[dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control

Thomas Monjalon thomas at monjalon.net
Thu Jul 12 19:30:55 CEST 2018


05/07/2018 16:45, Liang, Ma:
> On 27 Jun 18:33, Kevin Traynor wrote:
> > On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> > > From: Liang Ma <liang.j.ma at intel.com>
> > > 
> > > 1. Abstract
> > > 
> > > For packet processing workloads such as DPDK polling is continuous.
> > > This means CPU cores always show 100% busy independent of how much work
> > > those cores are doing. It is critical to accurately determine how busy
> > > a core is hugely important for the following reasons:
> > > 
> > >    * No indication of overload conditions
> > > 
> > >    * User do not know how much real load is on a system meaning resulted in
> > >      wasted energy as no power management is utilized
> > > 
> > > Tried and failed schemes include calculating the cycles required from
> > > the load on the core, in other words the busyness. For example,
> > > how many cycles it costs to handle each packet and determining the
> > > frequency cost per core. Due to the varying nature of traffic, types of
> > > frames and cost in cycles to process, this mechanism becomes complex
> > > quickly where a simple scheme is required to solve the problems.
> > > 
> > > 2. Proposed solution
> > > 
> > > For all polling mechanism, the proposed solution focus on how many times
> > > empty poll executed instead of calculating how many cycles it cost to
> > > handle each packet. The less empty poll number means current core is busy
> > > with processing workload, therefore,  the higher frequency is needed. The
> > > high empty poll number indicate current core has lots spare time,
> > > therefore, we can lower the frequency.
> > > 
> > 
> > Hi Liang/Radu,
> > 
> > I can see the benefit of providing an API for the application to provide
> > the num rx from each poll, and then have the library step down/up the
> > freq based on that. However, not sure I follow why you are adding the
> > complexity of defining power states and training modes.
> > 
> > > 2.1 Power state definition:
> > > 
> > > 	LOW:  the frequency is used for purge mode.
> > > 
> > > 	MED:  the frequency is used to process modest traffic workload.
> > > 
> > > 	HIGH: the frequency is used to process busy traffic workload.
> > > 
> > 
> > Why does there need to be user defined freq levels? Why not just keep
> > stepping down the freq until there is some user-defined threshold of
> > zero polls reached. e.g. keep stepping down until 10% of polls are zero
> > poll and have a tail of some time (perhaps user defined) for the step down.
> tranfer from one P-state to another P-state need update MSR which is expensive.
> and swap the state too many times will disturb the worker core performance.
> > 
> > > 2.2 There are two phases to establish the power management system:
> > > 
> > > 	a.Initialization/Training phase. There is no traffic pass-through,
> > > 	  the system will test average empty poll numbers  with
> > > 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> > > 	  will be the baseline
> > > 	  for the normal phase. The system will collect all core's counter
> > > 	  every 100ms. The Training phase will take 5 seconds.
> > > 
> > 
> > This is requiring an application to sit for 5 secs in order to train and
> > align poll numbers with states? That doesn't seem realistic to me.
> Because each CPU SKU has different configuration, micro-arch, cache size, 
> power state number etc. it's has to be tested in Training phase to find the 
> base line. simple app can block RX for the First 5 secs.
> > 
> > > 	b.Normal phase. When the real traffic pass-though, the system will
> > > 	  compare run-time empty poll moving average value with base line
> > > 	  then make decision to move to HIGH power state of MED  power
> > > 	  state. The system will collect all core's counter every 10ms.
> > > 
> > 
> > I only reviewed this commit msg and API usage, so maybe I didn't fully
> > get the use case or details, but it seems quite awkward from an
> > application perspective IMHO.
> > 
> > > 3. Proposed  API
> > > 
> > > 1.  rte_power_empty_poll_stat_init(void);
> > > which is used to initialize the power management system.
> > >  
> > > 2.  rte_power_empty_poll_stat_free(void);
> > > which is used to free the resource hold by power management system.
> > >  
> > > 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > > which is used to update specific core empty poll counter, not thread safe
> > >  
> > > 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > > which is used to update specific core valid poll counter, not thread safe
> > >  
> > 
> > I think 4 could be dropped and 3 used instead. It could be a simple API
> > that takes in the core and nb_pkts from a poll. Seems clearer than
> > making a separate API for a special value of nb_pkts (i.e. 0) and the
> > application having to check to know which API should be called.
> Agree.
> > 
> > > 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > > which is used to get specific core empty poll counter.
> > >  
> > > 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> > > which is used to get specific core valid poll counter.
> > > 
> > > 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > > which allow user customize the frequency of power state.
> > > 
> > > 8.  rte_power_empty_poll_setup_timer(void);
> > > which is used to setup the timer/callback to process all above counter.
> > > 
> > 
> > The new API should be experimental
> > 
> > > ChangeLog:
> > > v2: fix some coding style issues
> > > v3: rename the filename, API name.
> > > v4: updated makefile and symbol list
> > > 
> > > Signed-off-by: Liang Ma <liang.j.ma at intel.com>
> > > Signed-off-by: Radu Nicolau <radu.nicolau at intel.com>
> > > ---
> > >  lib/librte_power/Makefile               |   5 +-
> > >  lib/librte_power/meson.build            |   5 +-
> > >  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
> > >  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
> > >  lib/librte_power/rte_power_version.map  |  14 +-
> > >  5 files changed, 742 insertions(+), 5 deletions(-)
> > >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> > >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> > > 
> > 
> > Is there any in-tree documentation planned?

You did not reply to this question.
Usually, new API must be provided with documentation in programmers guide.

It would be interesting to have an opinion about this complicated API
outside of Intel.

This feature should wait 18.11.




More information about the dev mailing list