[PATCH v2 1/7] ethdev: fix race-condition of proactive error handling mode

lihuisong (C) lihuisong at huawei.com
Wed Nov 1 04:39:07 CET 2023


lgtm,
Acked-by: Huisong Li <lihuisong at huawei.com>

在 2023/10/20 18:07, Chengwen Feng 写道:
> In the proactive error handling mode, the PMD will set the data path
> pointers to dummy functions and then try recovery, in this period the
> application may still invoking data path API. This will introduce a
> race-condition with data path which may lead to crash [1].
>
> Although the PMD added delay after setting data path pointers to cover
> the above race-condition, it reduces the probability, but it doesn't
> solve the problem.
>
> To solve the race-condition problem fundamentally, the following
> requirements are added:
> 1. The PMD should set the data path pointers to dummy functions after
>     report RTE_ETH_EVENT_ERR_RECOVERING event.
> 2. The application should stop data path API invocation when process
>     the RTE_ETH_EVENT_ERR_RECOVERING event.
> 3. The PMD should set the data path pointers to valid functions before
>     report RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> 4. The application should enable data path API invocation when process
>     the RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>
> Also, this patch introduce a driver internal function
> rte_eth_fp_ops_setup which used as an help function for PMD.
Agreed with adding this internal interface to use for PMD.
otherwise driver have to operate the global variable.
>
> [1] http://patchwork.dpdk.org/project/dpdk/patch/20230220060839.1267349-2-ashok.k.kaladi@intel.com/
>
> Fixes: eb0d471a8941 ("ethdev: add proactive error handling mode")
> Cc: stable at dpdk.org
>
> Signed-off-by: Chengwen Feng <fengchengwen at huawei.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev at huawei.com>
> ---
>   doc/guides/prog_guide/poll_mode_drv.rst | 20 +++++++---------
>   lib/ethdev/ethdev_driver.c              |  8 +++++++
>   lib/ethdev/ethdev_driver.h              | 10 ++++++++
>   lib/ethdev/rte_ethdev.h                 | 32 +++++++++++++++----------
>   lib/ethdev/version.map                  |  1 +
>   5 files changed, 46 insertions(+), 25 deletions(-)
>
> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index c145a9066c..e380ff135a 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -638,14 +638,9 @@ different from the application invokes recovery in PASSIVE mode,
>   the PMD automatically recovers from error in PROACTIVE mode,
>   and only a small amount of work is required for the application.
>   
> -During error detection and automatic recovery,
> -the PMD sets the data path pointers to dummy functions
> -(which will prevent the crash),
> -and also make sure the control path operations fail with a return code ``-EBUSY``.
> -
> -Because the PMD recovers automatically,
> -the application can only sense that the data flow is disconnected for a while
> -and the control API returns an error in this period.
> +During error detection and automatic recovery, the PMD sets the data path
> +pointers to dummy functions and also make sure the control path operations
> +failed with a return code ``-EBUSY``.
>   
>   In order to sense the error happening/recovering,
>   as well as to restore some additional configuration,
> @@ -653,9 +648,9 @@ three events are available:
>   
>   ``RTE_ETH_EVENT_ERR_RECOVERING``
>      Notify the application that an error is detected
> -   and the recovery is being started.
> +   and the recovery is about to start.
>      Upon receiving the event, the application should not invoke
> -   any control path function until receiving
> +   any control and data path API until receiving
>      ``RTE_ETH_EVENT_RECOVERY_SUCCESS`` or ``RTE_ETH_EVENT_RECOVERY_FAILED`` event.
>   
>   .. note::
> @@ -666,8 +661,9 @@ three events are available:
>   
>   ``RTE_ETH_EVENT_RECOVERY_SUCCESS``
>      Notify the application that the recovery from error is successful,
> -   the PMD already re-configures the port,
> -   and the effect is the same as a restart operation.
> +   the PMD already re-configures the port.
> +   The application should restore some additional configuration, and then
> +   enable data path API invocation.
>   
>   ``RTE_ETH_EVENT_RECOVERY_FAILED``
>      Notify the application that the recovery from error failed,
> diff --git a/lib/ethdev/ethdev_driver.c b/lib/ethdev/ethdev_driver.c
> index fff4b7b4cd..65ead7b910 100644
> --- a/lib/ethdev/ethdev_driver.c
> +++ b/lib/ethdev/ethdev_driver.c
> @@ -537,6 +537,14 @@ rte_eth_dma_zone_free(const struct rte_eth_dev *dev, const char *ring_name,
>   	return rc;
>   }
>   
> +void
> +rte_eth_fp_ops_setup(struct rte_eth_dev *dev)
> +{
> +	if (dev == NULL)
> +		return;
> +	eth_dev_fp_ops_setup(rte_eth_fp_ops + dev->data->port_id, dev);
> +}
> +
>   const struct rte_memzone *
>   rte_eth_dma_zone_reserve(const struct rte_eth_dev *dev, const char *ring_name,
>   			 uint16_t queue_id, size_t size, unsigned int align,
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index deb23ada18..8567b96f53 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -1636,6 +1636,16 @@ int
>   rte_eth_dma_zone_free(const struct rte_eth_dev *eth_dev, const char *name,
>   		 uint16_t queue_id);
>   
> +/**
> + * @internal
> + * Setup eth fast-path API to ethdev values.
> + *
> + * @param dev
> + *  Pointer to struct rte_eth_dev.
> + */
> +__rte_internal
> +void rte_eth_fp_ops_setup(struct rte_eth_dev *dev);
> +
>   /**
>    * @internal
>    * Atomically set the link status for the specific device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 85b9af7a02..dbe2d9c745 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3994,25 +3994,28 @@ enum rte_eth_event_type {
>   	 */
>   	RTE_ETH_EVENT_RX_AVAIL_THRESH,
>   	/** Port recovering from a hardware or firmware error.
> -	 * If PMD supports proactive error recovery,
> -	 * it should trigger this event to notify application
> -	 * that it detected an error and the recovery is being started.
> -	 * Upon receiving the event, the application should not invoke any control path API
> -	 * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
> -	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
> -	 * The PMD will set the data path pointers to dummy functions,
> -	 * and re-set the data path pointers to non-dummy functions
> -	 * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> -	 * It means that the application cannot send or receive any packets
> -	 * during this period.
> +	 *
> +	 * If PMD supports proactive error recovery, it should trigger this
> +	 * event to notify application that it detected an error and the
> +	 * recovery is about to start.
> +	 *
> +	 * Upon receiving the event, the application should not invoke any
> +	 * control and data path API until receiving
> +	 * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
> +	 * event.
> +	 *
> +	 * Once this event is reported, the PMD will set the data path pointers
> +	 * to dummy functions, and re-set the data path pointers to valid
> +	 * functions before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> +	 *
>   	 * @note Before the PMD reports the recovery result,
>   	 * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>   	 * because a larger error may occur during the recovery.
>   	 */
>   	RTE_ETH_EVENT_ERR_RECOVERING,
>   	/** Port recovers successfully from the error.
> -	 * The PMD already re-configured the port,
> -	 * and the effect is the same as a restart operation.
> +	 *
> +	 * The PMD already re-configured the port:
>   	 * a) The following operation will be retained: (alphabetically)
>   	 *    - DCB configuration
>   	 *    - FEC configuration
> @@ -4039,6 +4042,9 @@ enum rte_eth_event_type {
>   	 *      (@see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP)
>   	 * c) Any other configuration will not be stored
>   	 *    and will need to be re-configured.
> +	 *
> +	 * The application should restore some additional configuration
> +	 * (see above case b/c), and then enable data path API invocation.
>   	 */
>   	RTE_ETH_EVENT_RECOVERY_SUCCESS,
>   	/** Port recovery failed.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 919ba5b8e6..1e6ee0a6f1 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -338,6 +338,7 @@ INTERNAL {
>   	rte_eth_devices;
>   	rte_eth_dma_zone_free;
>   	rte_eth_dma_zone_reserve;
> +	rte_eth_fp_ops_setup;
>   	rte_eth_hairpin_queue_peer_bind;
>   	rte_eth_hairpin_queue_peer_unbind;
>   	rte_eth_hairpin_queue_peer_update;


More information about the dev mailing list