[PATCH v11 2/5] ethdev: support proactive error handling mode

fengchengwen fengchengwen at huawei.com
Tue Oct 11 16:48:37 CEST 2022


Hi Andrew,

On 2022/10/10 16:47, Andrew Rybchenko wrote:
> On 10/9/22 12:10, Chengwen Feng wrote:
>> From: Kalesh AP <kalesh-anakkur.purayil at broadcom.com>
>>
>> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
>> to recover from the errors. In this process, the PMD sets the data path
>> pointers to dummy functions (which will prevent the crash), and also
>> make sure the control path operations failed with retcode -EBUSY.
>
> Could you explain why passive mode is not good. Why is
> proactive better? What are the benefits? IMHO, it would
> be simpler to have just one error recovery mode.


I think the two modes are not good or bad. To a large extent, they are 
determined

by the hardware and software design of the network card chip. Here take 
the hns3

driver as an examples:

During the error recovery, multiple handshakes are required between the 
driver and

the firmware, in addition, the handshake timeout are required.

If chose passive mode, the application may not register the callback 
(and also we

found that only ovs-dpdk register the reset event in many DPDK-based 
opensource

software), so the recovery will failed.  Furthermore, even if registered 
the callback,

the recovery process involves multiple handshakes which may take too 
much time

to complete, imagine having multiple ports to report the reset time at 
the same time.

(This possibility exists. Consider that the PF is reset due to multiple 
VFs under the PF.)

In this case, many VFs report event, but the event callback is executed 
sequentially

(because there is only one interrupt thread). As a result, later VFs 
cannot be processed

in time, and the reset may fails.


In conclusion, the proactive mode is an available troubleshooting method in

engineering practice.


>>
>> The above error handling mode is known as
>> RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE (proactive error handling mode).
>>
>> In some service scenarios, application needs to be aware of the event
>> to determine whether to migrate services. So three events were
>> introduced:
>>
>> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
>> detected an error and the recovery is being started. Upon receiving the
>> event, the application should not invoke any control path APIs until
>> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>
>> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
>> it recovers successful from the error, the PMD already re-configures the
>> port, and the effect is the same as that of the restart operation.
>>
>> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> recovers failed from the error, the port should not usable anymore. The
>> application should close the port.
>>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil at broadcom.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur at broadcom.com>
>> Signed-off-by: Chengwen Feng <fengchengwen at huawei.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde at broadcom.com>
>
> The code itself LGTM. I just want to understand why we need it.
> It should be proved in the description.
>


More information about the dev mailing list