[dpdk-dev] [RFC] hot plug failure handle mechanism

Thomas Monjalon thomas at monjalon.net
Thu Jun 14 23:37:17 CEST 2018


Hi,

I am sorry, it is very hard to be sure we understand correctly
your thougths. I like the proposal, but I want to be sure
my understanding was not biased by what I would like to read :)
So I try to reword below. Please confirm it matches your intent.

Hot unplug can happen when a hardware device is removed physically,
or when the software disables it. In both case, the datapath will fail.
When the unplug is detected, we must stop and close the related instance
of the driver.
The detection can be done with hotplug monitoring (like uevent)
- this is RTE_DEV_EVENT_REMOVE - or by handling the failure
in control path or data path - this is RTE_ETH_EVENT_INTR_RMV.
Between the unplug event and its detection, we need to manage
any related failure. That's why you propose a sigbus handler
which will avoid the crash, and can be used to detect the unplug.

Please confirm this is what you thought.
If not, do you agree, or am I missing something?

I would like to be sure the sigbus handler will not hide any other
unrelated failure.


07/06/2018 04:14, Guo, Jia:
> 
> On 6/6/2018 8:54 PM, Bruce Richardson wrote:
> > +Tech-board as I think that this should have more input at the design stage
> > ahead of any code patches being pushed.
> >
> > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> >> hi,bruce
> >>
> >>
> >> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> >>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> >>> <snip>
> >>>>      The hot plug failure handle mechanism should be come across as bellow:
> >>>>
> >>>>      1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
> >>>>      read/write error, it is bus-specific and each
> >>>>
> >>>>      kind of bus can implement its own logic.
> >>>>
> >>>>      2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> >>>>      function, base on the
> >>>>
> >>>>      failure address to remap memory which belong to the corresponding
> >>>>      device that unplugged.
> >>>>
> >>>>      3.      Implement a new sigbus handler, and register it when start
> >>>>      device event monitoring,
> >>>>
> >>>>      once the MMIO sigbus error exposure, it will trigger the above hot plug
> >>>>      failure handle mechanism,
> >>>>
> >>>>      that will keep app, that working on packet processing, would not be
> >>>>      broken and crash, then could
> >>>>
> >>>>      keep going clean, fail-safe or other working task.
> >>>>
> >>>>      4.      Also also will introduce the solution by use testpmd to show
> >>>>      the example of the whole procedure like that:
> >>>>
> >>>>      device unplug ->failure handle->stop forwarding->stop port->close
> >>>>      port->detach port.
> >>>>
> >>> Hi Jeff,
> >>>
> >>> so if I understand this correctly the proposal is that we need two parallel
> >>> solutions to handle safe removal of a device.
> >>>
> >>> 1. We need a solution to support unpluging of the device at the bus level,
> >>>      ie. remove the device from the list of devices and to make access to
> >>>      that device invalid.
> >>> 2. Since the removal of the device from the software lists is not going to
> >>>      be instantaneous, we need a mechanism to handle any accesses to the
> >>>      device from the data path until such time as the removal is complete. To
> >>>      support that, you propose to add a sigbus handler which will
> >>>      automatically replace any mmio bar mappings with some other memory that is
> >>>      ok to access - presumable zero memory or similar.
> >>>
> >>> Is this understanding correct?
> >> i think you are correct about that.
> >>
> >>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> >>> to the bus needs to come from the kernel. What is planned to be used there?
> >> about point #1, i should clarify here is that, we will use the device event
> >> monitor mechanism to detect the hot unplug event.
> >> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> >> driver) register the event callback. Once the hot unplug behave be detected,
> >> the user's callback could be triggered to let app(fail-safe driver) know the
> >> event and manage the process, it will call APIs to stop the device
> >> and detach the device from the bus.
> > Ok. If there is no failsafe driver, and the app does not set up a handler,
> > does nothing happen when we get a removal event? Will the app just crash?
> 
> when the device event monitor be enabled by app, the handler auto be set 
> up, app or fail safe driver no need and can not directly do it.
> so if app want to process this hot plug event, what they need to do is 
> only enable hot plug event monitor and register their self callback,
> then the app will not crash when hotplug behavior occur.
> 
> >>> You also talk about using testpmd as a reference for this, but you don't
> >>> explain how an application can be notified of a device removal, or even why
> >>> that is necessary. Since all applications should now be using the proper
> >>> macros when iterating device lists, and not just assuming devices 0-N are
> >>> valid, what changes would you see a normal app having to make to be
> >>> hotplug-safe?
> >> we could use app or fail-safe driver to use these mechanism , but at this
> >> stage i will firstly use testpmd as a reference.
> >> as above reply, testpmd should enable device event mechanism to monitor the
> >> device removal, and register callback,
> >> the device bdf list is managed by bus and the hoplug fail handler will be
> >> process by the eal layer, then the app would be hotplug-safe.
> >>
> >> is there anything i miss to clarify? please shout. and i think i will detail
> >> more later.
> > This is becoming clearer now, thanks. Just the one question above I have at
> > this point.
> > Given how long-running this issue of hotplug is, I'm hoping others on the
> > technical board can also review this proposal.
> >
> > Regards,
> > /Bruce
> 
> 







More information about the dev mailing list