[dpdk-dev] Question about hardware error handling policy

Thomas Monjalon thomas at monjalon.net
Fri Jul 23 14:51:55 CEST 2021


23/07/2021 14:33, Ferruh Yigit:
> On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
> > 22/07/2021 15:50, fengchengwen:
> >> Hi, all
> >>
> >>     I notice ethdev support dev_reset ops, which could be used to recover from
> >> errors, and only 13+ drivers support this function.
> 
> 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
> to be for error recovering.
> 
> >>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
> >> drivers support it (most of them are VF).
> >>
> >>     This provides users with two ways to handle hardware errors:
> >>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
> >>     b. application detect errors (the detection method is unclear), and call
> >>     reset ops to recover.
> >>
> >>     According to the design of this API, error handling is assigned to the
> >> application, and the driver is only responsible for reporting events. This
> >> simplifies the driver design (for example, the driver does not need to maintain
> >> mutex locks).
> >>
> >>     As we know, many modern NICs come with firmware, have PCIE interfaces,
> >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
> >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
> >> most drivers.
> >>
> >>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
> >> we think that the probability is very low and that there is no need to deal with
> >> them?
> > 
> > Even rare errors must be managed.
> > 
> 
> +1
> 
> >>     Question 2: I prefer to put error handling in the application layer, because
> >> doing it in the driver can make the driver complex, but there is no app to
> >> register the INTR_RESET event handler. I think we can build a standard handler
> >> in testpmd, What do you think?
> > 
> > Absolutely. As any ethdev API, it must be tested with testpmd.
> > 
> 
> Testpmd registers for RESET event, but when event received all it does is print
> a log, so there is not logic behind it.
> 
> If the intention is to add a error handling logic into testpmd, my concern is it
> being too complex or too device specific.

It shows a problem in the API.
We don't have a clear generic recovering process.

> And if there is something to cleanup, or recover etc in application level, it
> makes sense application to receive the event and act on it. But if the
> reset/recover can be handled in the PMD, if possible transparently, I think that
> is better choice.
> 
> Another thing is I am not sure if what the applications should do on the reset
> event clear or same for all PMDs, which is not good.

Indeed we should improve this area,
and implement a logic in testpmd.




More information about the dev mailing list