[dpdk-dev] Question about hardware error handling policy

fengchengwen fengchengwen at huawei.com
Fri Jul 23 04:18:08 CEST 2021


On 2021/7/22 23:46, Thomas Monjalon wrote:
> 22/07/2021 15:50, fengchengwen:
>> Hi, all
>>
>>     I notice ethdev support dev_reset ops, which could be used to recover from
>> errors, and only 13+ drivers support this function.
>>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
>> drivers support it (most of them are VF).
>>
>>     This provides users with two ways to handle hardware errors:
>>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>>     b. application detect errors (the detection method is unclear), and call
>>     reset ops to recover.
>>
>>     According to the design of this API, error handling is assigned to the
>> application, and the driver is only responsible for reporting events. This
>> simplifies the driver design (for example, the driver does not need to maintain
>> mutex locks).
>>
>>     As we know, many modern NICs come with firmware, have PCIE interfaces,
>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
>> most drivers.
>>
>>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
>> we think that the probability is very low and that there is no need to deal with
>> them?
> 
> Even rare errors must be managed.

Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code, and found:
1) i40e PF driver, it only show logs when detect global reset and other error:
	if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
		PMD_DRV_LOG(INFO, "ICR0: global reset requested");
	if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
		PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
	if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
		PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state");
   @Beilei Why not report RESET_EVENT in these cases ? or these error are very rarely
   so just report it. And also, the application may still send or recv packet, These
   resets, if not handled correctly, do not cause the hardware/driver to hang ?

2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which will
remove the device.
   @Matan Why not report RESET_EVENT in these cases ? so the application can be
recovered quickly.

> 
>>     Question 2: I prefer to put error handling in the application layer, because
>> doing it in the driver can make the driver complex, but there is no app to
>> register the INTR_RESET event handler. I think we can build a standard handler
>> in testpmd, What do you think?
> 
> Absolutely. As any ethdev API, it must be tested with testpmd.
> 
> 
> .
> 


More information about the dev mailing list