[PATCH v8 1/4] ethdev: support device error recovery notification
Ray Kinsella
mdr at ashroe.eu
Thu Jun 23 17:58:33 CEST 2022
Chengwen Feng <fengchengwen at huawei.com> writes:
> From: Kalesh AP <kalesh-anakkur.purayil at broadcom.com>
>
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
> to recover from the errors. In this process, the PMD sets the data path
> pointers to dummy functions (which will prevent the crash), and also
> make sure the control path operations failed with retcode -EBUSY.
>
> Also in this process, from the perspective of application, services are
> affected. For example, the Rx/Tx bust APIs cannot receive and send
> packets, and the control plane API return failure.
>
> In some service scenarios, application needs to be aware of the event
> to determine whether to migrate services. So three events were
> introduced:
>
> 1. RTE_ETH_EVENT_ERR_RECOVERING: the PMD must trigger this event to
> notify the application that it detected a hardware or firmware error
> and tries to recover.
> 2. RTE_ETH_EVENT_RECOVER_SUCCESS: the PMD must trigger this event to
> notify the application that it has recovered from the error. And PMD
> already re-configures the port to the state prior to the error.
> 3. RTE_ETH_EVENT_RECOVER_FAILED: the PMD must trigger this event to
> notify the application that it has failed to recover from the error.
> The port may not be usable anymore.
>
> Note: the error recovery of these events is mainly performed by the
> PMD. Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> performed by the application. The PMD must ensure that the above two
> error handling methods cannot be used at the same time.
>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil at broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur at broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen at huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde at broadcom.com>
> ---
> doc/guides/prog_guide/poll_mode_drv.rst | 32 +++++++++++++++++++++++++
> doc/guides/rel_notes/release_22_07.rst | 11 +++++++++
> lib/ethdev/rte_ethdev.h | 6 +++++
> 3 files changed, 49 insertions(+)
>
> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index 9d081b1cba..6398917485 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -627,3 +627,35 @@ by application.
> The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
> the application to handle reset event. It is duty of application to
> handle all synchronization before it calls rte_eth_dev_reset().
> +
> +Error Recovery Notification
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try to
> +recover from the errors. In this process, the PMD sets the data path pointers
> +to dummy functions (which will prevent the crash), and also make sure the
> +control path operations failed with retcode -EBUSY.
> +
> +Also in this process, from the perspective of application, services are
> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
> +and the control plane API return failure.
> +
> +In some service scenarios, application needs to be aware of the event to
> +determine whether to migrate services. So three events was introduced.
> +
> +The PMD must trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> +application that it detected a hardware or firmware error and tries to recover.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_SUCCESS event to notify the
> +application that it has recovered from the error. And PMD already re-configures
> +the port to the state prior to the error.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_FAILED event to notify the
> +application that it has failed to recover from the error. The port may not be
> +usable anymore.
> +
> +.. note::
> + The error recovery of these events is mainly performed by the PMD.
> + Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> + performed by the application. The PMD must ensure that the above two
> + error handling methods cannot be used at the same time.
> diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
> index 6fc044edaa..b237bd3303 100644
> --- a/doc/guides/rel_notes/release_22_07.rst
> +++ b/doc/guides/rel_notes/release_22_07.rst
> @@ -108,6 +108,17 @@ New Features
>
> Added an API which can get the device type of vDPA device.
>
> +* **Added error recover notification.**
> +
> + Added error recover notification to application including:
> +
> + * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
> + that the port is recovering from an error.
> + * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
> + that the port recover successful from an error.
RTE_ETH_EVENT_RECOVERY_SUCCESS
> + * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to
> report
RTE_ETH_EVENT_RECOVERY_FAILED
> + that the prot recover failed from an error
to report that port recovery failed
> +
> * **Updated Amazon ena driver.**
>
> The new driver version (v2.7.0) includes:
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 045ee64747..6998f6f0be 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
> * @see rte_eth_rx_avail_thresh_set()
> */
> RTE_ETH_EVENT_RX_AVAIL_THRESH,
> + /** Port recovering from a hardware or firmware error */
> + RTE_ETH_EVENT_ERR_RECOVERING,
> + /** Port recovers successful from the error */
> + RTE_ETH_EVENT_RECOVER_SUCCESS,
RTE_ETH_EVENT_RECOVERY_SUCCESS
> + /** Port recovers failed from the error */
> + RTE_ETH_EVENT_RECOVER_FAILED,
RTE_ETH_EVENT_RECOVERY_FAILED
> RTE_ETH_EVENT_MAX /**< max value of this enum */
> };
--
Regards, Ray K
More information about the dev
mailing list