[dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback

Ananyev, Konstantin konstantin.ananyev at intel.com
Wed Jan 13 13:58:56 CET 2021



> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov at intel.com>
> Sent: Tuesday, January 12, 2021 5:37 PM
> To: dev at dpdk.org
> Cc: Ma, Liang J <liang.j.ma at intel.com>; Hunt, David <david.hunt at intel.com>; Ray Kinsella <mdr at ashroe.eu>; Neil Horman
> <nhorman at tuxdriver.com>; thomas at monjalon.net; Ananyev, Konstantin <konstantin.ananyev at intel.com>; McDaniel, Timothy
> <timothy.mcdaniel at intel.com>; Richardson, Bruce <bruce.richardson at intel.com>; Macnamara, Chris <chris.macnamara at intel.com>
> Subject: [PATCH v16 07/11] power: add PMD power management API and callback
> 
> From: Liang Ma <liang.j.ma at intel.com>
> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
> 
> This design is using PMD RX callbacks.
> 
> 1. UMWAIT/UMONITOR:
> 
>    When a certain threshold of empty polls is reached, the core will go
>    into a power optimized sleep while waiting on an address of next RX
>    descriptor to be written to.
> 
> 2. TPAUSE/Pause instruction
> 
>    This method uses the pause (or TPAUSE, if available) instruction to
>    avoid busy polling.
> 
> 3. Frequency scaling
>    Reuse existing DPDK power library to scale up/down core frequency
>    depending on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma at intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
> ---
> 
> Notes:
>     v15:
>     - Fix check in UMWAIT callback
> 
>     v13:
>     - Rework the synchronization mechanism to not require locking
>     - Add more parameter checking
>     - Rework n_rx_queues access to not go through internal PMD structures and use
>       public API instead
> 
>     v13:
>     - Rework the synchronization mechanism to not require locking
>     - Add more parameter checking
>     - Rework n_rx_queues access to not go through internal PMD structures and use
>       public API instead
> 
>  doc/guides/prog_guide/power_man.rst    |  44 +++
>  doc/guides/rel_notes/release_21_02.rst |  10 +
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
>  lib/librte_power/version.map           |   5 +
>  6 files changed, 511 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 

...

> +
> +static uint16_t
> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> +		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> +		void *addr __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			struct rte_power_monitor_cond pmc;
> +			uint16_t ret;
> +
> +			/*
> +			 * we might get a cancellation request while being
> +			 * inside the callback, in which case the wakeup
> +			 * wouldn't work because it would've arrived too early.
> +			 *
> +			 * to get around this, we notify the other thread that
> +			 * we're sleeping, so that it can spin until we're done.
> +			 * unsolicited wakeups are perfectly safe.
> +			 */
> +			q_conf->umwait_in_progress = true;

This write and subsequent read can be reordered by the cpu.
I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and
in disable() code-path below.

> +
> +			/* check if we need to cancel sleep */
> +			if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +				/* use monitoring condition to sleep */
> +				ret = rte_eth_get_monitor_addr(port_id, qidx,
> +						&pmc);
> +				if (ret == 0)
> +					rte_power_monitor(&pmc, -1ULL);
> +			}
> +			q_conf->umwait_in_progress = false;
> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +

...

> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +
> +	if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
> +		return -EINVAL;
> +
> +	/* no need to check queue id as wrong queue id would not be enabled */
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> +		return -EINVAL;
> +
> +	/* let the callback know we're shutting down */
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;

Same as above - write to pwr_mgmt_state and read from umwait_in_progress
could be reordered by cpu.
Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them.

BTW, out of curiosity - why do you need this intermediate
state (PMD_MGMT_BUSY) at all?
Why not directly:
queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
?

> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_MONITOR:
> +	{
> +		bool exit = false;
> +		do {
> +			/*
> +			 * we may request cancellation while the other thread
> +			 * has just entered the callback but hasn't started
> +			 * sleeping yet, so keep waking it up until we know it's
> +			 * done sleeping.
> +			 */
> +			if (queue_cfg->umwait_in_progress)
> +				rte_power_monitor_wakeup(lcore_id);
> +			else
> +				exit = true;
> +		} while (!exit);
> +	}
> +	/* fall-through */
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/*
> +	 * we don't free the RX callback here because it is unsafe to do so
> +	 * unless we know for a fact that all data plane threads have stopped.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
> new file mode 100644
> index 0000000000..0bfbc6ba69
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.h
> @@ -0,0 +1,90 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_PMD_MGMT_H
> +#define _RTE_POWER_PMD_MGMT_H
> +
> +/**
> + * @file
> + * RTE PMD Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_power.h>
> +#include <rte_atomic.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * PMD Power Management Type
> + */
> +enum rte_power_pmd_mgmt_type {
> +	/** Use power-optimized monitoring to wait for incoming traffic */
> +	RTE_POWER_MGMT_TYPE_MONITOR = 1,
> +	/** Use power-optimized sleep to avoid busy polling */
> +	RTE_POWER_MGMT_TYPE_PAUSE,
> +	/** Use frequency scaling when traffic is low */
> +	RTE_POWER_MGMT_TYPE_SCALE,
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable power management on a specified RX queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function type.
> +
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Disable power management on a specified RX queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id);
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
> index 69ca9af616..61996b4d11 100644
> --- a/lib/librte_power/version.map
> +++ b/lib/librte_power/version.map
> @@ -34,4 +34,9 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +
> +	# added in 21.02
> +	rte_power_pmd_mgmt_queue_enable;
> +	rte_power_pmd_mgmt_queue_disable;
> +
>  };
> --
> 2.25.1


More information about the dev mailing list