[dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop

Matan Azrad matan at mellanox.com
Sat Feb 3 20:42:14 CET 2018


Hi Adrien

> From: Adrien Mazarguil , Sent: Friday, February 2, 2018 9:53 PM
> Hi Matan,
> 
> On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> > From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> > > Hi Matan,
> > >
> > > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > > > Hi Adrien
> <snip>
> > > > I don't know what any application does but for me it is a mistake
> > > > to stop all event processes in dev_stop(), Maybe for other
> > > > application
> > > maintainers too.
> > >
> > > Just like you, I don't know either what all the applications ever
> > > written for DPDK expect out of dev_stop(). What's for sure is that
> > > currently, LSC/RMV don't occcur afterward, the same way these events
> > > do not occur before dev_start().
> >
> > Why not? RMV event can occur any time after probe.
> 
> LSC as well (keep in mind this patch modifies the behavior for both events).
> RMV events may also occur before application has a chance to register a
> handler for it, in which case this approach fails to solve the problem it's
> supposed to solve. Mitigate all you want, the application still can't rely on that
> event only.

Again, the application should register to event when it want to be notified about it by callback and unregister to it when it doesn't want to.
So, if application still didn't has chance to register to it, its ok not to get notification about it.
Obviously, Application can check both RMV and LCS statuses after the first registration.
 

> > > Any application possibly relying on this fact will break. In such a
> > > situation, a conservative approach is better.
> >
> > If an application should fail to get event in stopped state it may fail in the
> previous code too:
> > The interrupt run from host thread so the next race may occur:
> > dev_start() : master thread.
> > Context switch.
> > RMV interrupt started to run callbacks: host thread.
> > Context switch.
> > dev_stop(): master thread.
> > Start reconfiguration: master thread.
> > Context switch.
> > Callback running.
> >
> > So, the only thing which can disable callback running after dev_stop() is
> callback unregistration before it.
> 
> After dev_stop() returns, new events cannot be triggered by the PMD which
> is what matters.

No, from application point of view it can happen even if the PMD will stop to trigger it from dev_stop() (as the old code did).
This is exactly what the above scenario says.
 
 >Obviously a callback that already started to run before that
> will eventually have to complete. What's your point?

It can even start after dev_stop() in spite of the PMD will stop to trigger the event, read the above scenario again.

So, my point is:
If application is not safe to get notification after dev_stop() it may fail in both old and new versions.
So every PMD which stops to trigger event from dev_stop() because of your reasons makes a mistake and fails for its purpose. 

 
> There's a race only if an application performs multiple simultaneous control
> operations on the underlying device, but this has always been unsafe (not
> only during RMV) because there are no locks, it's documented as such.

So, it is probably safe.

> > > > > Setting up RMV/LSC callbacks is not the only configuration an
> > > > > application usually performs before calling dev_start(). Think
> > > > > about setting up flow rules, MAC addresses, VLANs, and so on,
> > > > > this on multiple ports before starting them up all at once.
> > > > > Previously it could be done in an unspecified order, now they
> > > > > have to take special care
> > > for RMV/LSC.
> > > >
> > > > Or maybe there callbacks code are already safe for it.
> > > > Or they manages the unregister\register calls in the right places.
> > >
> > > That's my point, these "maybes" don't argue in favor of changing things.
> >
> > What I'm saying is that callbacks should be safe or not registered in the right
> time.
> 
> I understand that, though it's not a valid counter argument :)

With the above scenario it is :)

> > > > > Many devops are only safe when called while a device is stopped.
> > > > > It's even documented in rte_ethdev.h.
> > > > >
> > > >
> > > > And?
> > >
> > > ...And applications therefore often do all their configuration in an
> > > unspecified order while a port is stopped as a measure of safety. No
> > > extra care is taken for RMV/LSC. This uncertainty can be addressed
> > > by not modifying the current behavior.
> >
> > Or they expect to get interrupt and the corner case will come later if we will
> not change it.
> 
> Look, we're throwing opposite use cases at each other in order to make a
> point, and I don't see an end to this since we're both stubborn. Let's thus
> assume applications use a bit of both.
> 
> Now we're left with a problem, before this patch neither use cases were
> broken.

No, again, The above scenario shows it can be broken even before this patch.
 
> Now it's applied, mine is broken so let's agree something needs to
> be done. Either all affected applications need to be updated, or we can
> simply revert this and properly fix fail-safe instead.

I think no, applications which will fail because of this patch may fail before this patch.
So, probably applications should be safe for this patch.

> <snip>
> > > > So, at least for RMV event, we need the notification also in stopped
> state.
> > >
> > > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> > > implementing this call benefit from an automatic is_removed() check
> > > on all remaining devops whenever some error occur.
> > > In short, an application will get notified simply by getting
> > > dev_start() (or any other callback) return -EIO and not being able to use
> the device.
> >
> > Yes, but between dev_stop to dev_start may not be any ethdev API calling.
> 
> So what, if an application is not using the device, why does it need to know
> it's been removed?

Data path will not check synchronously the removal status - this is must come by asynchronic  notification.

 >If it's that important, why can't it run its own periodic
> rte_eth_dev_is_removed() probe?

Because ETHDEV have event for it, why to do similar mechanism again?

> > > PMDs that do not implement is_removed() (or in addition to it) could
> > > also artificially trigger a RMV event after dev_start() is called.
> > > As long as the PMD remains quiet while the device is stopped, it's fine.
> >
> > How can the PMD do it after dev_start()? Initiate alarm in dev start function
> to do it later And entering into race again?
> 
> What race? All the PMD needs to to is trigger an event by registering one
> with immediate effect, it won't make any difference to the application. If it
> performs racy control operations from the handler, it's never been a PMD's
> problem.

See the race scenario above, it should be similar.

> Anyway I just provided this idea as a counter argument, it's not really
> needed; relying on rte_eth_dev_is_removed() is the safest approach in my
> opinion.

Not always.
We need them both to be really hotplug aware from all sides.

> > I think it is not worth the effort and this patch is doing the right thing(also
> some other PMDs) .
> 
> Well, it's probably too late to revert this patch for 18.02 since one would also
> have to come up with the proper fix for fail-safe, however that doesn't mean
> breaking things and expecting applications to deal with it because it's never
> been properly documented is OK either.

Nothing breaking for my opinion.

> I'm post-NACKing this patch, it will have to be reverted and a fix submitted
> for the upcoming 18.02 stable branch. In the meantime, it's not a fix for
> mlx4 and as such must not be backported.

I can't agree with you about it now.
But you know, You are the maintainer :) do whatever you want.

Anyway, I think we are all agree that all PMDs should do the same thing and documentation somewhere must be added.
My opinion, as you probably know, is to continue to trigger events even after dev_stop().

Thanks,
Matan. 
> --
> Adrien Mazarguil
> 6WIND


More information about the dev mailing list