[dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control

Liang, Ma liang.j.ma at intel.com
Thu Jul 5 16:45:34 CEST 2018


On 27 Jun 18:33, Kevin Traynor wrote:
> On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> > From: Liang Ma <liang.j.ma at intel.com>
> > 
> > 1. Abstract
> > 
> > For packet processing workloads such as DPDK polling is continuous.
> > This means CPU cores always show 100% busy independent of how much work
> > those cores are doing. It is critical to accurately determine how busy
> > a core is hugely important for the following reasons:
> > 
> >    * No indication of overload conditions
> > 
> >    * User do not know how much real load is on a system meaning resulted in
> >      wasted energy as no power management is utilized
> > 
> > Tried and failed schemes include calculating the cycles required from
> > the load on the core, in other words the busyness. For example,
> > how many cycles it costs to handle each packet and determining the
> > frequency cost per core. Due to the varying nature of traffic, types of
> > frames and cost in cycles to process, this mechanism becomes complex
> > quickly where a simple scheme is required to solve the problems.
> > 
> > 2. Proposed solution
> > 
> > For all polling mechanism, the proposed solution focus on how many times
> > empty poll executed instead of calculating how many cycles it cost to
> > handle each packet. The less empty poll number means current core is busy
> > with processing workload, therefore,  the higher frequency is needed. The
> > high empty poll number indicate current core has lots spare time,
> > therefore, we can lower the frequency.
> > 
> 
> Hi Liang/Radu,
> 
> I can see the benefit of providing an API for the application to provide
> the num rx from each poll, and then have the library step down/up the
> freq based on that. However, not sure I follow why you are adding the
> complexity of defining power states and training modes.
> 
> > 2.1 Power state definition:
> > 
> > 	LOW:  the frequency is used for purge mode.
> > 
> > 	MED:  the frequency is used to process modest traffic workload.
> > 
> > 	HIGH: the frequency is used to process busy traffic workload.
> > 
> 
> Why does there need to be user defined freq levels? Why not just keep
> stepping down the freq until there is some user-defined threshold of
> zero polls reached. e.g. keep stepping down until 10% of polls are zero
> poll and have a tail of some time (perhaps user defined) for the step down.
tranfer from one P-state to another P-state need update MSR which is expensive.
and swap the state too many times will disturb the worker core performance.
> 
> > 2.2 There are two phases to establish the power management system:
> > 
> > 	a.Initialization/Training phase. There is no traffic pass-through,
> > 	  the system will test average empty poll numbers  with
> > 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> > 	  will be the baseline
> > 	  for the normal phase. The system will collect all core's counter
> > 	  every 100ms. The Training phase will take 5 seconds.
> > 
> 
> This is requiring an application to sit for 5 secs in order to train and
> align poll numbers with states? That doesn't seem realistic to me.
Because each CPU SKU has different configuration, micro-arch, cache size, 
power state number etc. it's has to be tested in Training phase to find the 
base line. simple app can block RX for the First 5 secs.
> 
> > 	b.Normal phase. When the real traffic pass-though, the system will
> > 	  compare run-time empty poll moving average value with base line
> > 	  then make decision to move to HIGH power state of MED  power
> > 	  state. The system will collect all core's counter every 10ms.
> > 
> 
> I only reviewed this commit msg and API usage, so maybe I didn't fully
> get the use case or details, but it seems quite awkward from an
> application perspective IMHO.
> 
> > 3. Proposed  API
> > 
> > 1.  rte_power_empty_poll_stat_init(void);
> > which is used to initialize the power management system.
> >  
> > 2.  rte_power_empty_poll_stat_free(void);
> > which is used to free the resource hold by power management system.
> >  
> > 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > which is used to update specific core empty poll counter, not thread safe
> >  
> > 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > which is used to update specific core valid poll counter, not thread safe
> >  
> 
> I think 4 could be dropped and 3 used instead. It could be a simple API
> that takes in the core and nb_pkts from a poll. Seems clearer than
> making a separate API for a special value of nb_pkts (i.e. 0) and the
> application having to check to know which API should be called.
Agree.
> 
> > 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core empty poll counter.
> >  
> > 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core valid poll counter.
> > 
> > 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > which allow user customize the frequency of power state.
> > 
> > 8.  rte_power_empty_poll_setup_timer(void);
> > which is used to setup the timer/callback to process all above counter.
> > 
> 
> The new API should be experimental
> 
> > ChangeLog:
> > v2: fix some coding style issues
> > v3: rename the filename, API name.
> > v4: updated makefile and symbol list
> > 
> > Signed-off-by: Liang Ma <liang.j.ma at intel.com>
> > Signed-off-by: Radu Nicolau <radu.nicolau at intel.com>
> > ---
> >  lib/librte_power/Makefile               |   5 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  14 +-
> >  5 files changed, 742 insertions(+), 5 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> > 
> 
> Is there any in-tree documentation planned?
> 
> Kevin.


More information about the dev mailing list