[dpdk-dev] [PATCH v16 07/11] power: add PMD power management API and callback

Burakov, Anatoly anatoly.burakov at intel.com
Thu Jan 14 14:00:07 CET 2021


On 13-Jan-21 5:29 PM, Burakov, Anatoly wrote:
> On 13-Jan-21 12:58 PM, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: Burakov, Anatoly <anatoly.burakov at intel.com>
>>> Sent: Tuesday, January 12, 2021 5:37 PM
>>> To: dev at dpdk.org
>>> Cc: Ma, Liang J <liang.j.ma at intel.com>; Hunt, David 
>>> <david.hunt at intel.com>; Ray Kinsella <mdr at ashroe.eu>; Neil Horman
>>> <nhorman at tuxdriver.com>; thomas at monjalon.net; Ananyev, Konstantin 
>>> <konstantin.ananyev at intel.com>; McDaniel, Timothy
>>> <timothy.mcdaniel at intel.com>; Richardson, Bruce 
>>> <bruce.richardson at intel.com>; Macnamara, Chris 
>>> <chris.macnamara at intel.com>
>>> Subject: [PATCH v16 07/11] power: add PMD power management API and 
>>> callback
>>>
>>> From: Liang Ma <liang.j.ma at intel.com>
>>>
>>> Add a simple on/off switch that will enable saving power when no
>>> packets are arriving. It is based on counting the number of empty
>>> polls and, when the number reaches a certain threshold, entering an
>>> architecture-defined optimized power state that will either wait
>>> until a TSC timestamp expires, or when packets arrive.
>>>
>>> This API mandates a core-to-single-queue mapping (that is, multiple
>>> queued per device are supported, but they have to be polled on different
>>> cores).
>>>
>>> This design is using PMD RX callbacks.
>>>
>>> 1. UMWAIT/UMONITOR:
>>>
>>>     When a certain threshold of empty polls is reached, the core will go
>>>     into a power optimized sleep while waiting on an address of next RX
>>>     descriptor to be written to.
>>>
>>> 2. TPAUSE/Pause instruction
>>>
>>>     This method uses the pause (or TPAUSE, if available) instruction to
>>>     avoid busy polling.
>>>
>>> 3. Frequency scaling
>>>     Reuse existing DPDK power library to scale up/down core frequency
>>>     depending on traffic volume.
>>>
>>> Signed-off-by: Liang Ma <liang.j.ma at intel.com>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>>> ---
>>>
>>> Notes:
>>>      v15:
>>>      - Fix check in UMWAIT callback
>>>
>>>      v13:
>>>      - Rework the synchronization mechanism to not require locking
>>>      - Add more parameter checking
>>>      - Rework n_rx_queues access to not go through internal PMD 
>>> structures and use
>>>        public API instead
>>>
>>>      v13:
>>>      - Rework the synchronization mechanism to not require locking
>>>      - Add more parameter checking
>>>      - Rework n_rx_queues access to not go through internal PMD 
>>> structures and use
>>>        public API instead
>>>
>>>   doc/guides/prog_guide/power_man.rst    |  44 +++
>>>   doc/guides/rel_notes/release_21_02.rst |  10 +
>>>   lib/librte_power/meson.build           |   5 +-
>>>   lib/librte_power/rte_power_pmd_mgmt.c  | 359 +++++++++++++++++++++++++
>>>   lib/librte_power/rte_power_pmd_mgmt.h  |  90 +++++++
>>>   lib/librte_power/version.map           |   5 +
>>>   6 files changed, 511 insertions(+), 2 deletions(-)
>>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>>>   create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
>>>
>>
>> ...
>>
>>> +
>>> +static uint16_t
>>> +clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts 
>>> __rte_unused,
>>> +uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>>> +void *addr __rte_unused)
>>> +{
>>> +
>>> +struct pmd_queue_cfg *q_conf;
>>> +
>>> +q_conf = &port_cfg[port_id][qidx];
>>> +
>>> +if (unlikely(nb_rx == 0)) {
>>> +q_conf->empty_poll_stats++;
>>> +if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>>> +struct rte_power_monitor_cond pmc;
>>> +uint16_t ret;
>>> +
>>> +/*
>>> + * we might get a cancellation request while being
>>> + * inside the callback, in which case the wakeup
>>> + * wouldn't work because it would've arrived too early.
>>> + *
>>> + * to get around this, we notify the other thread that
>>> + * we're sleeping, so that it can spin until we're done.
>>> + * unsolicited wakeups are perfectly safe.
>>> + */
>>> +q_conf->umwait_in_progress = true;
>>
>> This write and subsequent read can be reordered by the cpu.
>> I think you need rte_atomic_thread_fence(__ATOMIC_SEQ_CST) here and
>> in disable() code-path below.
>>
>>> +
>>> +/* check if we need to cancel sleep */
>>> +if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) {
>>> +/* use monitoring condition to sleep */
>>> +ret = rte_eth_get_monitor_addr(port_id, qidx,
>>> +&pmc);
>>> +if (ret == 0)
>>> +rte_power_monitor(&pmc, -1ULL);
>>> +}
>>> +q_conf->umwait_in_progress = false;
>>> +}
>>> +} else
>>> +q_conf->empty_poll_stats = 0;
>>> +
>>> +return nb_rx;
>>> +}
>>> +
>>
>> ...
>>
>>> +
>>> +int
>>> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>> +uint16_t port_id, uint16_t queue_id)
>>> +{
>>> +struct pmd_queue_cfg *queue_cfg;
>>> +
>>> +RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>>> +
>>> +if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT)
>>> +return -EINVAL;
>>> +
>>> +/* no need to check queue id as wrong queue id would not be enabled */
>>> +queue_cfg = &port_cfg[port_id][queue_id];
>>> +
>>> +if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>>> +return -EINVAL;
>>> +
>>> +/* let the callback know we're shutting down */
>>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_BUSY;
>>
>> Same as above - write to pwr_mgmt_state and read from umwait_in_progress
>> could be reordered by cpu.
>> Need to insert rte_atomic_thread_fence(__ATOMIC_SEQ_CST) between them.
>>
>> BTW, out of curiosity - why do you need this intermediate
>> state (PMD_MGMT_BUSY) at all?
>> Why not directly:
>> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> ?
>>
> 
> Thanks for suggestions, i'll add those.
> 
> The goal for the "intermediate" step is to prevent Rx callback from 
> sleeping in the first place. We can't "wake up" earlier than it goes to 
> sleep, but we may get a request to disable power management while we're 
> at the beginning of the callback and haven't yet entered the 
> rte_power_monitor code.
> 
> In this case, setting it to "BUSY" will prevent the callback from ever 
> sleeping in the first place (see rte_power_pmd_mgmt:108 check), and will 
> unset the "umwait in progress" if there was any.
> 
> So, we have three situations to handle:
> 
> 1) wake up during umwait
> 2) "wake up" during callback after we've set the "umwait in progress" 
> flag but before actual umwait happens - we don't wait to exit before 
> we're sure there's nothing sleeping there
> 3) "wake up" during callback before we set the "umwait in progress" flag
> 
> 1) is handled by the rte_power_monitor_wakeup() call, so that's taken 
> care of. 2) is handled by the other thread waiting on "umwait in 
> progress" becoming false. 3) is handled by having this BUSY check in the 
> umwait thread.
> 
> Hope that made sense!
> 

On further thoughts, the "BUSY" thing relies on a hidden assumption that 
enable/disable power management per queue is supposed to be thread safe. 
If we let go of this assumption, we can get by with just enable/disable, 
so i think i'll just document the thread safety and leave out the "BUSY" 
part.

-- 
Thanks,
Anatoly


More information about the dev mailing list