[dpdk-dev] [RFC] hot plug failure handle mechanism
Bruce Richardson
bruce.richardson at intel.com
Wed Jun 6 14:54:52 CEST 2018
+Tech-board as I think that this should have more input at the design stage
ahead of any code patches being pushed.
On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> hi,bruce
>
>
> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> > <snip>
> > > The hot plug failure handle mechanism should be come across as bellow:
> > >
> > > 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus
> > > read/write error, it is bus-specific and each
> > >
> > > kind of bus can implement its own logic.
> > >
> > > 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> > > function, base on the
> > >
> > > failure address to remap memory which belong to the corresponding
> > > device that unplugged.
> > >
> > > 3. Implement a new sigbus handler, and register it when start
> > > device event monitoring,
> > >
> > > once the MMIO sigbus error exposure, it will trigger the above hot plug
> > > failure handle mechanism,
> > >
> > > that will keep app, that working on packet processing, would not be
> > > broken and crash, then could
> > >
> > > keep going clean, fail-safe or other working task.
> > >
> > > 4. Also also will introduce the solution by use testpmd to show
> > > the example of the whole procedure like that:
> > >
> > > device unplug ->failure handle->stop forwarding->stop port->close
> > > port->detach port.
> > >
> > Hi Jeff,
> >
> > so if I understand this correctly the proposal is that we need two parallel
> > solutions to handle safe removal of a device.
> >
> > 1. We need a solution to support unpluging of the device at the bus level,
> > ie. remove the device from the list of devices and to make access to
> > that device invalid.
> > 2. Since the removal of the device from the software lists is not going to
> > be instantaneous, we need a mechanism to handle any accesses to the
> > device from the data path until such time as the removal is complete. To
> > support that, you propose to add a sigbus handler which will
> > automatically replace any mmio bar mappings with some other memory that is
> > ok to access - presumable zero memory or similar.
> >
> > Is this understanding correct?
>
> i think you are correct about that.
>
> > Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> > to the bus needs to come from the kernel. What is planned to be used there?
>
> about point #1, i should clarify here is that, we will use the device event
> monitor mechanism to detect the hot unplug event.
> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> driver) register the event callback. Once the hot unplug behave be detected,
> the user's callback could be triggered to let app(fail-safe driver) know the
> event and manage the process, it will call APIs to stop the device
> and detach the device from the bus.
Ok. If there is no failsafe driver, and the app does not set up a handler,
does nothing happen when we get a removal event? Will the app just crash?
>
> > You also talk about using testpmd as a reference for this, but you don't
> > explain how an application can be notified of a device removal, or even why
> > that is necessary. Since all applications should now be using the proper
> > macros when iterating device lists, and not just assuming devices 0-N are
> > valid, what changes would you see a normal app having to make to be
> > hotplug-safe?
>
> we could use app or fail-safe driver to use these mechanism , but at this
> stage i will firstly use testpmd as a reference.
> as above reply, testpmd should enable device event mechanism to monitor the
> device removal, and register callback,
> the device bdf list is managed by bus and the hoplug fail handler will be
> process by the eal layer, then the app would be hotplug-safe.
>
> is there anything i miss to clarify? please shout. and i think i will detail
> more later.
This is becoming clearer now, thanks. Just the one question above I have at
this point.
Given how long-running this issue of hotplug is, I'm hoping others on the
technical board can also review this proposal.
Regards,
/Bruce
More information about the dev
mailing list