[dpdk-dev] [PATCH v10 5/9] power: add PMD power management API and callback
Liang Ma
liang.j.ma at intel.com
Tue Oct 27 15:59:05 CET 2020
Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.
This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).
This design is using PMD RX callbacks.
1. UMWAIT/UMONITOR:
When a certain threshold of empty polls is reached, the core will go
into a power optimized sleep while waiting on an address of next RX
descriptor to be written to.
2. Pause instruction
Instead of move the core into deeper C state, this method uses the
pause instruction to avoid busy polling.
3. Frequency scaling
Reuse existing DPDK power library to scale up/down core frequency
depending on traffic volume.
Signed-off-by: Liang Ma <liang.j.ma at intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
Acked-by: David Hunt <david.hunt at intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev at intel.com>
---
Notes:
v10:
- Updated power library document
v8:
- Rename version map file name
v7:
- Fixed race condition (Konstantin)
- Slight rework of the structure of monitor code
- Added missing inline for wakeup
v6:
- Added wakeup mechanism for UMWAIT
- Removed memory allocation (everything is now allocated statically)
- Fixed various typos and comments
- Check for invalid queue ID
- Moved release notes to this patch
v5:
- Make error checking more robust
- Prevent initializing scaling if ACPI or PSTATE env wasn't set
- Prevent initializing UMWAIT path if PMD doesn't support
get_wake_addr
- Add some debug logging
- Replace x86-specific code path to generic path using the
intrinsic check
---
doc/guides/prog_guide/power_man.rst | 48 ++++
doc/guides/rel_notes/release_20_11.rst | 11 +
lib/librte_power/meson.build | 5 +-
lib/librte_power/rte_power_pmd_mgmt.c | 320 +++++++++++++++++++++++++
lib/librte_power/rte_power_pmd_mgmt.h | 92 +++++++
lib/librte_power/version.map | 4 +
6 files changed, 478 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index 0a3755a901..1b1064c749 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -192,6 +192,51 @@ User Cases
----------
The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+PMD Power Management API
+------------------------
+
+Abstract
+~~~~~~~~
+
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+There are multiple power saving schemes available for developer to choose.
+Although developer can configure each queue with different scheme, however,
+It's strongly recommend to configure the queue within same port with same scheme.
+
+ * UMWAIT/UMONITOR
+
+ This power saving scheme will put the CPU into optimized power state and use
+ the UMWAIT/UMONITOR instructions to monitor the Ethernet PMD RX descriptor
+ address, and wake the CPU up whenever there's new traffic.
+
+ * Pause
+
+ This power saving scheme will use the `rte_pause` function to avoid busy
+ polling.
+
+ * Frequency scaling
+
+ This power saving scheme will use existing power library functionality to
+ scale the core frequency up/down depending on traffic volume.
+
+
+.. note::
+
+ Currently, this power management API is limited to mandatory mapping of 1
+ queue to 1 core (multiple queues are supported, but they must be polled from
+ different cores).
+
+API Overview for PMD Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Queue Enable**: Enable specific power scheme for certain queue/port/core
+
+* **Queue Disable**: Disable power scheme for certain queue/port/core
+
References
----------
@@ -200,3 +245,6 @@ References
* The :doc:`../sample_app_ug/vm_power_management`
chapter in the :doc:`../sample_app_ug/index` section.
+
+* The :doc:`../sample_app_ug/rxtx_callbacks`
+ chapter in the :doc:`../sample_app_ug/index` section.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 2bdc8f9948..5fd8c16025 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -349,6 +349,17 @@ New Features
* Replaced ``--scalar`` command-line option with ``--alg=<value>``, to allow
the user to select the desired classify method.
+* **Add PMD power management mechanism**
+
+ 3 new Ethernet PMD power management mechanism is added through existing
+ RX callback infrastructure.
+
+ * Add power saving scheme based on UMWAIT instruction (x86 only)
+ * Add power saving scheme based on ``rte_pause()``
+ * Add power saving scheme based on frequency scaling through the power library
+ * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+ * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
Removed Items
-------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
'power_kvm_vm.c', 'guest_channel.c',
'rte_power_empty_poll.c',
'power_pstate_cpufreq.c',
+ 'rte_power_pmd_mgmt.c',
'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..0dcaddc3bd
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,320 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX 512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+ /** Device power management is disabled. */
+ PMD_MGMT_DISABLED = 0,
+ /** Device power management is enabled. */
+ PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+ enum pmd_mgmt_state pwr_mgmt_state;
+ /**< State of power management for this queue */
+ enum rte_power_pmd_mgmt_type cb_mode;
+ /**< Callback mode for this queue */
+ const struct rte_eth_rxtx_callback *cur_cb;
+ /**< Callback instance */
+ rte_spinlock_t umwait_lock;
+ /**< Per-queue status lock - used only for UMWAIT mode */
+ volatile void *wait_addr;
+ /**< UMWAIT wakeup address */
+ uint64_t empty_poll_stats;
+ /**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static inline void
+umwait_wakeup(volatile void *addr)
+{
+ uint64_t val;
+
+ val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+ __atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+ __ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static inline void
+umwait_sleep(struct pmd_queue_cfg *q_conf, uint16_t port_id, uint16_t qidx)
+{
+ volatile void *target_addr;
+ uint64_t expected, mask;
+ uint8_t data_sz;
+ uint16_t ret;
+
+ /*
+ * get wake up address for this RX queue, as well as expected value,
+ * comparison mask, and data size.
+ */
+ ret = rte_eth_get_wake_addr(port_id, qidx, &target_addr,
+ &expected, &mask, &data_sz);
+
+ /* this should always succeed as all checks have been done already */
+ if (unlikely(ret != 0))
+ return;
+
+ /*
+ * take out a spinlock to prevent control plane from concurrently
+ * modifying the wakeup data.
+ */
+ rte_spinlock_lock(&q_conf->umwait_lock);
+
+ /* have we been disabled by control plane? */
+ if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+ /* we're good to go */
+
+ /*
+ * store the wakeup address so that control plane can trigger a
+ * write to this address and wake us up.
+ */
+ q_conf->wait_addr = target_addr;
+ /* -1ULL is maximum value for TSC */
+ rte_power_monitor_sync(target_addr, expected, mask, -1ULL,
+ data_sz, &q_conf->umwait_lock);
+ /* erase the address */
+ q_conf->wait_addr = NULL;
+ }
+ rte_spinlock_unlock(&q_conf->umwait_lock);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+ struct pmd_queue_cfg *q_conf;
+
+ q_conf = &port_cfg[port_id][qidx];
+
+ if (unlikely(nb_rx == 0)) {
+ q_conf->empty_poll_stats++;
+ if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+ umwait_sleep(q_conf, port_id, qidx);
+ } else
+ q_conf->empty_poll_stats = 0;
+
+ return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+ struct pmd_queue_cfg *q_conf;
+
+ q_conf = &port_cfg[port_id][qidx];
+
+ if (unlikely(nb_rx == 0)) {
+ q_conf->empty_poll_stats++;
+ /* sleep for 1 microsecond */
+ if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+ rte_delay_us(1);
+ } else
+ q_conf->empty_poll_stats = 0;
+
+ return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+ struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+ uint16_t max_pkts __rte_unused, void *_ __rte_unused)
+{
+ struct pmd_queue_cfg *q_conf;
+
+ q_conf = &port_cfg[port_id][qidx];
+
+ if (unlikely(nb_rx == 0)) {
+ q_conf->empty_poll_stats++;
+ if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+ /* scale down freq */
+ rte_power_freq_min(rte_lcore_id());
+ } else {
+ q_conf->empty_poll_stats = 0;
+ /* scale up freq */
+ rte_power_freq_max(rte_lcore_id());
+ }
+
+ return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+ uint16_t port_id, uint16_t queue_id,
+ enum rte_power_pmd_mgmt_type mode)
+{
+ struct rte_eth_dev *dev;
+ struct pmd_queue_cfg *queue_cfg;
+ int ret;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+ dev = &rte_eth_devices[port_id];
+
+ /* check if queue id is valid */
+ if (queue_id >= dev->data->nb_rx_queues ||
+ queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ return -EINVAL;
+ }
+
+ queue_cfg = &port_cfg[port_id][queue_id];
+
+ if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+ ret = -EINVAL;
+ goto end;
+ }
+
+ switch (mode) {
+ case RTE_POWER_MGMT_TYPE_WAIT:
+ {
+ /* check if rte_power_monitor is supported */
+ uint64_t dummy_expected, dummy_mask;
+ struct rte_cpu_intrinsics i;
+ volatile void *dummy_addr;
+ uint8_t dummy_sz;
+
+ rte_cpu_get_intrinsics_support(&i);
+
+ if (!i.power_monitor) {
+ RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+ ret = -ENOTSUP;
+ goto end;
+ }
+
+ /* check if the device supports the necessary PMD API */
+ if (rte_eth_get_wake_addr(port_id, queue_id,
+ &dummy_addr, &dummy_expected,
+ &dummy_mask, &dummy_sz) == -ENOTSUP) {
+ RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+ ret = -ENOTSUP;
+ goto end;
+ }
+ /* initialize UMWAIT spinlock */
+ rte_spinlock_init(&queue_cfg->umwait_lock);
+
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb_umwait, NULL);
+ break;
+ }
+ case RTE_POWER_MGMT_TYPE_SCALE:
+ {
+ enum power_management_env env;
+ /* only PSTATE and ACPI modes are supported */
+ if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+ !rte_power_check_env_supported(
+ PM_ENV_PSTATE_CPUFREQ)) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+ ret = -ENOTSUP;
+ goto end;
+ }
+ /* ensure we could initialize the power library */
+ if (rte_power_init(lcore_id)) {
+ ret = -EINVAL;
+ goto end;
+ }
+ /* ensure we initialized the correct env */
+ env = rte_power_get_env();
+ if (env != PM_ENV_ACPI_CPUFREQ &&
+ env != PM_ENV_PSTATE_CPUFREQ) {
+ RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+ ret = -ENOTSUP;
+ goto end;
+ }
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+ queue_id, clb_scale_freq, NULL);
+ break;
+ }
+ case RTE_POWER_MGMT_TYPE_PAUSE:
+ /* initialize data before enabling the callback */
+ queue_cfg->empty_poll_stats = 0;
+ queue_cfg->cb_mode = mode;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+ queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+ clb_pause, NULL);
+ break;
+ }
+ ret = 0;
+
+end:
+ return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+ uint16_t port_id, uint16_t queue_id)
+{
+ struct pmd_queue_cfg *queue_cfg;
+ int ret;
+
+ queue_cfg = &port_cfg[port_id][queue_id];
+
+ if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+ ret = -EINVAL;
+ goto end;
+ }
+
+ switch (queue_cfg->cb_mode) {
+ case RTE_POWER_MGMT_TYPE_WAIT:
+ rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+ /* wake up the core from UMWAIT sleep, if any */
+ if (queue_cfg->wait_addr != NULL)
+ umwait_wakeup(queue_cfg->wait_addr);
+ /*
+ * we need to disable early as there might be callback currently
+ * spinning on a lock.
+ */
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+ rte_spinlock_unlock(&queue_cfg->umwait_lock);
+ /* fall-through */
+ case RTE_POWER_MGMT_TYPE_PAUSE:
+ rte_eth_remove_rx_callback(port_id, queue_id,
+ queue_cfg->cur_cb);
+ break;
+ case RTE_POWER_MGMT_TYPE_SCALE:
+ rte_power_freq_max(lcore_id);
+ rte_eth_remove_rx_callback(port_id, queue_id,
+ queue_cfg->cur_cb);
+ rte_power_exit(lcore_id);
+ break;
+ }
+ /*
+ * we don't free the RX callback here because it is unsafe to do so
+ * unless we know for a fact that all data plane threads have stopped.
+ */
+ queue_cfg->cur_cb = NULL;
+ queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+ ret = 0;
+end:
+ return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+ /** WAIT callback mode. */
+ RTE_POWER_MGMT_TYPE_WAIT = 1,
+ /** PAUSE callback mode. */
+ RTE_POWER_MGMT_TYPE_PAUSE,
+ /** Freq Scaling callback mode. */
+ RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ * lcore_id.
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The queue identifier of the Ethernet device.
+ * @param mode
+ * The power management callback function type.
+
+ * @return
+ * 0 on success
+ * <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+ uint16_t port_id,
+ uint16_t queue_id,
+ enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ * lcore_id.
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The queue identifier of the Ethernet device.
+ * @return
+ * 0 on success
+ * <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+ uint16_t port_id,
+ uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/version.map b/lib/librte_power/version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/version.map
+++ b/lib/librte_power/version.map
@@ -34,4 +34,8 @@ EXPERIMENTAL {
rte_power_guest_channel_receive_msg;
rte_power_poll_stat_fetch;
rte_power_poll_stat_update;
+ # added in 20.11
+ rte_power_pmd_mgmt_queue_enable;
+ rte_power_pmd_mgmt_queue_disable;
+
};
--
2.17.1
More information about the dev
mailing list