[dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support

Thomas Monjalon thomas at monjalon.net
Thu Mar 12 08:34:15 CET 2020


12/03/2020 04:25, Kalesh Anakkur Purayil:
> Hi Thomas,
> 
> On Wed, Mar 11, 2020 at 6:49 PM Thomas Monjalon <thomas at monjalon.net> wrote:
> 
> > 22/01/2020 11:16, Kalesh A P:
> > > From: Kalesh AP <kalesh-anakkur.purayil at broadcom.com>
> > >
> > > This patch adds support for recovery event in rte_eth_event framework.
> > > FW error and FW reset conditions would be managed by PMD. Driver uses
> >
> > "Driver"? THE driver? :)
> >
> > > RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
> > > FW reset or error.
> >
> > Which drivers doe that?
> >
> [Kalesh]: Second patch in this series implements this behavior in bnxt PMD.
> Error recovery is a new feature added in bnxt PMD in 19.11. This change is
> needed to support error recovery functionality.
> 
> >
> > > In such cases, PMD would need recovery events to
> > > notify application about PMD has recovered from FW reset or FW error.
> >
> > Sorry I don't understand. You said application is notified of any error.
> > But the PMD can recover from this error? So what is the error at the end?
> > If the error is recovered why notifying the application?
> >
> [Kalesh] : Let me give you some insight on this.
> 
> The error recovery solution is a protocol implemented between firmware and
> bnxt PMD to recover from the fatal errors without a system reboot. There is
> an alarm thread which constantly monitors the health of the firmware and
> initiates a recovery when needed.
> 
> There are two scenarios here:
> 
> 1. Hardware or firmware encountered an error which firmware detected.
> Firmware is in operational status here. In this case, firmware can reset
> the chip and notify the driver about the reset.
> 2. Hardware or firmware encountered an error but firmware is dead/hung.
> Firmware is not in operational status. In this case, the only possible way
> to recover the adapter is through host driver(bnxt PMD).
> 
> In both cases, bnxt PMD reinitializes with the FW again after the reset.
> During that recovery process, data path will be halted and any control path
> operation would fail. So, bnxt PMD has to notify the application about this
> reset/error event to prevent any activities from application during this
> time.

I think you are changing the meaning of the reset event.
It was described like this:
RTE_ETH_EVENT_INTR_RESET,
            /**< reset interrupt event, sent to VF on PF reset */

Please update this description as well.

Of course, we'll need approval from other PMD maintainers
to accept the new recovery API.




More information about the dev mailing list