[dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control

Kevin Traynor ktraynor at redhat.com
Wed Jun 27 19:33:04 CEST 2018


On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> From: Liang Ma <liang.j.ma at intel.com>
> 
> 1. Abstract
> 
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
> 
>    * No indication of overload conditions
> 
>    * User do not know how much real load is on a system meaning resulted in
>      wasted energy as no power management is utilized
> 
> Tried and failed schemes include calculating the cycles required from
> the load on the core, in other words the busyness. For example,
> how many cycles it costs to handle each packet and determining the
> frequency cost per core. Due to the varying nature of traffic, types of
> frames and cost in cycles to process, this mechanism becomes complex
> quickly where a simple scheme is required to solve the problems.
> 
> 2. Proposed solution
> 
> For all polling mechanism, the proposed solution focus on how many times
> empty poll executed instead of calculating how many cycles it cost to
> handle each packet. The less empty poll number means current core is busy
> with processing workload, therefore,  the higher frequency is needed. The
> high empty poll number indicate current core has lots spare time,
> therefore, we can lower the frequency.
> 

Hi Liang/Radu,

I can see the benefit of providing an API for the application to provide
the num rx from each poll, and then have the library step down/up the
freq based on that. However, not sure I follow why you are adding the
complexity of defining power states and training modes.

> 2.1 Power state definition:
> 
> 	LOW:  the frequency is used for purge mode.
> 
> 	MED:  the frequency is used to process modest traffic workload.
> 
> 	HIGH: the frequency is used to process busy traffic workload.
> 

Why does there need to be user defined freq levels? Why not just keep
stepping down the freq until there is some user-defined threshold of
zero polls reached. e.g. keep stepping down until 10% of polls are zero
poll and have a tail of some time (perhaps user defined) for the step down.

> 2.2 There are two phases to establish the power management system:
> 
> 	a.Initialization/Training phase. There is no traffic pass-through,
> 	  the system will test average empty poll numbers  with
> 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> 	  will be the baseline
> 	  for the normal phase. The system will collect all core's counter
> 	  every 100ms. The Training phase will take 5 seconds.
> 

This is requiring an application to sit for 5 secs in order to train and
align poll numbers with states? That doesn't seem realistic to me.

> 	b.Normal phase. When the real traffic pass-though, the system will
> 	  compare run-time empty poll moving average value with base line
> 	  then make decision to move to HIGH power state of MED  power
> 	  state. The system will collect all core's counter every 10ms.
> 

I only reviewed this commit msg and API usage, so maybe I didn't fully
get the use case or details, but it seems quite awkward from an
application perspective IMHO.

> 3. Proposed  API
> 
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>  
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>  
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>  
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>  

I think 4 could be dropped and 3 used instead. It could be a simple API
that takes in the core and nb_pkts from a poll. Seems clearer than
making a separate API for a special value of nb_pkts (i.e. 0) and the
application having to check to know which API should be called.

> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>  
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
> 
> 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> which allow user customize the frequency of power state.
> 
> 8.  rte_power_empty_poll_setup_timer(void);
> which is used to setup the timer/callback to process all above counter.
> 

The new API should be experimental

> ChangeLog:
> v2: fix some coding style issues
> v3: rename the filename, API name.
> v4: updated makefile and symbol list
> 
> Signed-off-by: Liang Ma <liang.j.ma at intel.com>
> Signed-off-by: Radu Nicolau <radu.nicolau at intel.com>
> ---
>  lib/librte_power/Makefile               |   5 +-
>  lib/librte_power/meson.build            |   5 +-
>  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
>  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
>  lib/librte_power/rte_power_version.map  |  14 +-
>  5 files changed, 742 insertions(+), 5 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_empty_poll.c
>  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> 

Is there any in-tree documentation planned?

Kevin.


More information about the dev mailing list