[dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control

Liang, Ma liang.j.ma at intel.com
Tue Oct 2 12:13:19 CEST 2018


Hi Dave, 
    Please check comment below. 

On 28 Sep 11:47, Hunt, David wrote:
> Hi Liang,
> 
> 
> On 17/9/2018 2:30 PM, Liang Ma wrote:
> >1. Abstract
> >
> >For packet processing workloads such as DPDK polling is continuous.
> >This means CPU cores always show 100% busy independent of how much work
> >those cores are doing. It is critical to accurately determine how busy
> >a core is hugely important for the following reasons:
> >
> >    * No indication of overload conditions
> >
> >    * User do not know how much real load is on a system meaning resulted 
> >    in
> >      wasted energy as no power management is utilized
> >
> >Compared to the original l3fwd-power design, instead of going to sleep
> >after detecting an empty poll, the new mechanism just lowers the core
> >frequency. As a result, the application does not stop polling the device,
> >which leads to improved handling of bursts of traffic.
> >
> >When the system become busy, the empty poll mechanism can also increase the
> >core frequency (including turbo) to do best effort for intensive traffic.
> >This gives us more flexible and balanced traffic awareness over the
> >standard l3fwd-power application.
> >
> >2. Proposed solution
> >
> >The proposed solution focuses on how many times empty polls are executed.
> >The less the number of empty polls, means current core is busy with
> >processing workload, therefore, the higher frequency is needed. The high
> >empty poll number indicates the current core not doing any real work
> >therefore, we can lower the frequency to safe power.
> >
> >In the current implementation, each core has 1 empty-poll counter which
> >assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> >future to support multiple queues per core.
> >
> >2.1 Power state definition:
> >
> >	LOW:  Not currently used, reserved for future use.
> >
> >	MED:  the frequency is used to process modest traffic workload.
> >
> >	HIGH: the frequency is used to process busy traffic workload.
> >
> >2.2 There are two phases to establish the power management system:
> >
> >	a.Initialization/Training phase. The training phase is necessary
> >	  in order to figure out the system polling baseline numbers from
> >	  idle to busy. The highest poll count will be during idle, where
> >	  all polls are empty. These poll counts will be different between
> >	  systems due to the many possible processor micro-arch, cache
> >	  and device configurations, hence the training phase.
> >   	  In the training phase, traffic is blocked so the training
> >   	  algorithm can average the empty-poll numbers for the LOW, MED and
> >  	  HIGH  power states in order to create a baseline.
> >   	  The core's counter are collected every 10ms, and the Training
> >  	  phase will take 2 seconds.
> >  	  Training is disabled as default configuration. the default
> >  	  parameter is applied. Simple App still can trigger training
> 
> Typo: "Simple" should be "Sample"
> 
> Suggest adding: Once the training phase has been executed once on a 
> system, the application
> can then be started with the relevant thresholds provided on the command 
> line, allowing the
> application to start passing start traffic immediately.
agree
> 
> >  	  if that's needed.
> >
> >	b.Normal phase. When the training phase is complete, traffic is
> >   	  started. The run-time poll counts are compared with the
> >	  baseline and the decision will be taken to move to MED power
> >   	  state or HIGH power state. The counters are calculated every 10ms.
> 
> Propose changing the first sentence:  Traffic starts immediately based 
> on the default
> thresholds, or based on the user supplied thresholds via the command 
> line parameters.
>
agree
> 
> 
> 
> >3. Proposed  API
> >
> >1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
> >		uint8_t *freq_tlb, struct ep_policy *policy);
> >which is used to initialize the power management system.
> >  
> >2.  rte_power_empty_poll_stat_free(void);
> >which is used to free the resource hold by power management system.
> >  
> >3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >which is used to update specific core empty poll counter, not thread safe
> >  
> >4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >which is used to update specific core valid poll counter, not thread safe
> >  
> >5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core empty poll counter.
> >  
> >6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core valid poll counter.
> >
> >7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >which is used to detect empty poll state changes then take action.
> >
> >ChangeLog:
> >v2: fix some coding style issues.
> >v3: rename the filename, API name.
> >v4: no change.
> >v5: no change.
> >v6: re-work the code layout, update API.
> >v7: fix minor typo and lift node num limit.
> >v8: disable training as default option.
> >
> >Signed-off-by: Liang Ma <liang.j.ma at intel.com>
> >
> >Reviewed-by: Lei Yao <lei.a.yao at intel.com>
> >---
> >  lib/librte_power/Makefile               |   6 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 539 
> >  ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  13 +
> >  5 files changed, 778 insertions(+), 4 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> >
> >diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> >index 6f85e88..a8f1301 100644
> >--- a/lib/librte_power/Makefile
> >+++ b/lib/librte_power/Makefile
> >@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
> >  # library name
> >  LIB = librte_power.a
> >  
> >+CFLAGS += -DALLOW_EXPERIMENTAL_API
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> >-LDLIBS += -lrte_eal
> >+LDLIBS += -lrte_eal -lrte_timer
> >  
> >  EXPORT_MAP := rte_power_version.map
> >  
> >@@ -16,8 +17,9 @@ LIBABIVER := 1
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> >+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> >  
> >  # install this header file
> >-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> >+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  
> >rte_power_empty_poll.h
> >  
> >  include $(RTE_SDK)/mk/rte.lib.mk
> >diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> >index 253173f..63957eb 100644
> >--- a/lib/librte_power/meson.build
> >+++ b/lib/librte_power/meson.build
> >@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >  	build = false
> >  endif
> >  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> >-		'power_kvm_vm.c', 'guest_channel.c')
> >-headers = files('rte_power.h')
> >+		'power_kvm_vm.c', 'guest_channel.c',
> >+		'rte_power_empty_poll.c')
> >+headers = files('rte_power.h','rte_power_empty_poll.h')
> >diff --git a/lib/librte_power/rte_power_empty_poll.c 
> >b/lib/librte_power/rte_power_empty_poll.c
> >new file mode 100644
> >index 0000000..587ce78
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.c
> >@@ -0,0 +1,539 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#include <string.h>
> >+
> >+#include <rte_lcore.h>
> >+#include <rte_cycles.h>
> >+#include <rte_atomic.h>
> >+#include <rte_malloc.h>
> >+
> >+#include "rte_power.h"
> >+#include "rte_power_empty_poll.h"
> >+
> >+#define INTERVALS_PER_SECOND 100     /* (10ms) */
> >+#define SECONDS_TO_TRAIN_FOR 2
> >+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> >+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> >+#define DEFAULT_CYCLES_PER_PACKET 800
> >+
> >+static struct ep_params *ep_params;
> >+static uint32_t med_to_high_threshold = 
> >DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> >+static uint32_t high_to_med_threshold = 
> >DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> >+
> >+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> >+
> >+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> >+
> >+static uint32_t freq_index[NUM_FREQ];
> >+
> >+static uint32_t
> >+get_freq_index(enum freq_val index)
> >+{
> >+	return freq_index[index];
> >+}
> >+
> >+
> >+static int
> >+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> >+{
> >+	int err = 0;
> >+	uint32_t power_freq_index;
> >+	if (!specific_freq)
> >+		power_freq_index = get_freq_index(freq);
> >+	else
> >+		power_freq_index = freq;
> >+
> >+	err = rte_power_set_freq(lcore_id, power_freq_index);
> >+
> >+	return err;
> >+}
> >+
> >+
> >+static inline void __attribute__((always_inline))
> >+exit_training_state(struct priority_worker *poll_stats)
> >+{
> >+	RTE_SET_USED(poll_stats);
> >+}
> >+
> 
> Is this really needed? It does nothing, and is just local to this file.
> 
this is needed for debug purpose, I prefer keep it. 
> 
> >+static inline void __attribute__((always_inline))
> >+enter_training_state(struct priority_worker *poll_stats)
> >+{
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->cur_freq = LOW;
> >+	poll_stats->queue_state = TRAINING;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_normal_state(struct priority_worker *poll_stats)
> >+{
> >+	/* Clear the averages arrays and strs */
> >+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+	poll_stats->ec = 0;
> >+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+	poll_stats->pc = 0;
> >+
> >+	poll_stats->cur_freq = MED;
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->threshold_ctr = 0;
> >+	poll_stats->queue_state = MED_NORMAL;
> >+	RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");
> 
> Typo, "poewr" should be "power", also suggest "Set" rather than "set"
>
agree
> 
> >+	set_power_freq(poll_stats->lcore_id, MED, false);
> >+
> >+	/* Try here */
> 
> Not sure about this comment?
will be removed
> 
> >+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> >+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_busy_state(struct priority_worker *poll_stats)
> >+{
> >+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+	poll_stats->ec = 0;
> >+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+	poll_stats->pc = 0;
> >+
> >+	poll_stats->cur_freq = HGH;
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->threshold_ctr = 0;
> >+	poll_stats->queue_state = HGH_BUSY;
> >+	set_power_freq(poll_stats->lcore_id, HGH, false);
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_purge_state(struct priority_worker *poll_stats)
> >+{
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->queue_state = LOW_PURGE;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_state(struct priority_worker *poll_stats,
> >+		enum queue_state new_state)
> >+{
> >+	enum queue_state old_state = poll_stats->queue_state;
> >+	if (old_state != new_state) {
> >+
> >+		/* Call any old state exit functions */
> >+		if (old_state == TRAINING)
> >+			exit_training_state(poll_stats);
> 
> Is this needed? exit_training_state() does nothing.
> 
original code is used for debug purpose. we can leave it.
> >+		/* Call any new state entry functions */
> >+		if (new_state == TRAINING)
> >+			enter_training_state(poll_stats);
> >+		if (new_state == MED_NORMAL)
> >+			enter_normal_state(poll_stats);
> >+		if (new_state == HGH_BUSY)
> >+			enter_busy_state(poll_stats);
> >+		if (new_state == LOW_PURGE)
> >+			enter_purge_state(poll_stats);
> >+	}
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_policy(struct priority_worker *poll_stats,
> >+		struct ep_policy *policy)
> >+{
> >+	set_state(poll_stats, policy->state);
> >+
> >+	if (policy->state == TRAINING)
> >+		return;
> >+
> >+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
> >+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
> >+
> >+	poll_stats->thresh[MED_NORMAL].trained = true;
> >+	poll_stats->thresh[HGH_BUSY].trained = true;
> >+
> >+}
> >+
> >+static void
> >+update_training_stats(struct priority_worker *poll_stats,
> >+		uint32_t freq,
> >+		bool specific_freq,
> >+		uint32_t max_train_iter)
> >+{
> >+	RTE_SET_USED(specific_freq);
> >+
> >+	char pfi_str[32];
> >+	uint64_t p0_empty_deq;
> >+
> >+	sprintf(pfi_str, "%02d", freq);
> >+
> >+	if (poll_stats->cur_freq == freq &&
> >+			poll_stats->thresh[freq].trained == false) {
> >+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> >+
> >+			set_power_freq(poll_stats->lcore_id,
> >+					freq, specific_freq);
> >+
> >+			poll_stats->empty_dequeues_prev =
> >+				poll_stats->empty_dequeues;
> >+
> >+			poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+			return;
> >+		} else if (poll_stats->thresh[freq].cur_train_iter
> >+				<= max_train_iter) {
> >+
> >+			p0_empty_deq = poll_stats->empty_dequeues -
> >+				poll_stats->empty_dequeues_prev;
> >+
> >+			poll_stats->empty_dequeues_prev =
> >+				poll_stats->empty_dequeues;
> >+
> >+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> >+			poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+		} else {
> >+			if (poll_stats->thresh[freq].trained == false) {
> >+				poll_stats->thresh[freq].base_edpi =
> >+					poll_stats->thresh[freq].base_edpi /
> >+					max_train_iter;
> >+
> >+				/* Add on a factor of 0.05%*/
> >+				/* this should remove any */
> >+				/* false negatives when the system is 0% 
> >busy */
> 
> Multi line comment should follow the usual standard /* \n * text \n text 
> \n */
> 
agree
> >+				poll_stats->thresh[freq].base_edpi +=
> >+					poll_stats->thresh[freq].base_edpi / 
> >2000;
> >+
> >+				poll_stats->thresh[freq].trained = true;
> >+				poll_stats->cur_freq++;
> >+
> >+			}
> >+		}
> >+	}
> >+}
> >+
> >+static inline uint32_t __attribute__((always_inline))
> >+update_stats(struct priority_worker *poll_stats)
> >+{
> >+	uint64_t tot_edpi = 0, tot_ppi = 0;
> >+	uint32_t j, percent;
> >+
> >+	struct priority_worker *s = poll_stats;
> >+
> >+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> >+
> >+	s->empty_dequeues_prev = s->empty_dequeues;
> >+
> >+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> >+
> >+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> >+
> >+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> >+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> >+				"cur edpi %ld "
> >+				"base epdi %ld\n",
> >+				cur_edpi,
> >+				s->thresh[s->cur_freq].base_edpi);
> 
> Suggest making this log message more meaningful to the user. I suspect 
> that "cur_edpi" will not mean much to the user.
> What does edpi mean?
>
agree
> >+		/* Value to make us fail need debug log*/
> >+		return 1000UL;
> >+	}
> >+
> >+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> >+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> >+
> >+	for (j = 0; j < BINS_AV; j++) {
> >+		tot_edpi += s->edpi_av[j];
> >+		tot_ppi += s->ppi_av[j];
> >+	}
> >+
> >+	tot_edpi = tot_edpi / BINS_AV;
> >+
> >+	percent = 100 - (uint32_t)(((float)tot_edpi /
> >+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> >+
> >+	return (uint32_t)percent;
> >+}
> >+
> >+
> >+static inline void  __attribute__((always_inline))
> >+update_stats_normal(struct priority_worker *poll_stats)
> >+{
> >+	uint32_t percent;
> >+
> >+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
> >+
> >+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
> >+				poll_stats->cur_freq,
> >+			 poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> Again, a more meaningful explanation of edpi is needed here.
> 
agree
> >+		return;
> >+	}
> >+
> >+	percent = update_stats(poll_stats);
> >+
> >+	if (percent > 100) {
> >+		RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");
> 
> Please change to something meaningful to the user. What is the 
> percentage returned from update_stats()?
> 
agree
> >+		return;
> >+	}
> >+
> >+	if (poll_stats->cur_freq == LOW)
> >+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> 
> Suggest adding "currently" - "Purge Mode is not currently supported\n"
> 
agree
> >+	else if (poll_stats->cur_freq == MED) {
> >+
> >+		if (percent >
> >+			poll_stats->thresh[MED].threshold_percent) {
> >+
> >+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+				poll_stats->threshold_ctr++;
> >+			else {
> >+				set_state(poll_stats, HGH_BUSY);
> >+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> >+			}
> >+
> >+		} else {
> >+			/* reset */
> >+			poll_stats->threshold_ctr = 0;
> >+		}
> >+
> >+	} else if (poll_stats->cur_freq == HGH) {
> >+
> >+		if (percent <
> >+				poll_stats->thresh[HGH].threshold_percent) {
> >+
> >+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+				poll_stats->threshold_ctr++;
> >+			else {
> >+				set_state(poll_stats, MED_NORMAL);
> >+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
> >+			}
> >+		} else {
> >+			/* reset */
> >+			poll_stats->threshold_ctr = 0;
> >+		}
> >+
> >+	}
> >+}
> >+
> >+static int
> >+empty_poll_training(struct priority_worker *poll_stats,
> >+		uint32_t max_train_iter)
> >+{
> >+
> >+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> >+		poll_stats->iter_counter++;
> >+		return 0;
> >+	}
> >+
> >+
> >+	update_training_stats(poll_stats,
> >+			LOW,
> >+			false,
> >+			max_train_iter);
> >+
> >+	update_training_stats(poll_stats,
> >+			MED,
> >+			false,
> >+			max_train_iter);
> >+
> >+	update_training_stats(poll_stats,
> >+			HGH,
> >+			false,
> >+			max_train_iter);
> >+
> >+
> >+	if (poll_stats->thresh[LOW].trained == true
> >+			&& poll_stats->thresh[MED].trained == true
> >+			&& poll_stats->thresh[HGH].trained == true) {
> >+
> >+		set_state(poll_stats, MED_NORMAL);
> >+
> >+		RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
> >+				poll_stats->thresh[LOW].base_edpi);
> 
> Suggest "Low" change to "LOW" for consistency with other log messages below.
> 
agree
> >+
> >+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
> >+				poll_stats->thresh[MED].base_edpi);
> >+
> >+
> >+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
> >+				poll_stats->thresh[HGH].base_edpi);
> >+
> >+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> >+				poll_stats->lcore_id);
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+void
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
> >+{
> >+
> >+	uint32_t i;
> >+
> >+	struct priority_worker *poll_stats;
> >+
> >+	RTE_SET_USED(tim);
> >+
> >+	RTE_SET_USED(arg);
> >+
> >+	for (i = 0; i < NUM_NODES; i++) {
> >+
> >+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> >+
> >+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> >+			continue;
> >+
> >+		switch (poll_stats->queue_state) {
> >+		case(TRAINING):
> >+			empty_poll_training(poll_stats,
> >+					ep_params->max_train_iter);
> >+			break;
> >+
> >+		case(HGH_BUSY):
> >+		case(MED_NORMAL):
> >+			update_stats_normal(poll_stats);
> >+			break;
> >+
> >+		case(LOW_PURGE):
> >+			break;
> >+		default:
> >+			break;
> >+
> >+		}
> >+
> >+	}
> >+
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+		struct ep_policy *policy)
> >+{
> >+	uint32_t i;
> >+	/* Allocate the ep_params structure */
> >+	ep_params = rte_zmalloc_socket(NULL,
> >+			sizeof(struct ep_params),
> >+			0,
> >+			rte_socket_id());
> >+
> >+	if (!ep_params)
> >+		rte_panic("Cannot allocate heap memory for ep_params "
> >+				"for socket %d\n", rte_socket_id());
> >+
> >+	if (freq_tlb == NULL) {
> >+		freq_index[LOW] = 14;
> >+		freq_index[MED] = 9;
> >+		freq_index[HGH] = 1;
> >+	} else {
> >+		freq_index[LOW] = freq_tlb[LOW];
> >+		freq_index[MED] = freq_tlb[MED];
> >+		freq_index[HGH] = freq_tlb[HGH];
> >+	}
> >+
> >+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> >+
> >+	/* 5 seconds worth of training */
> 
> This now looks to be 2 seconds from the #define above. Maybe: /* Train 
> for pre-defined period */
> 
agree
> >+	ep_params->max_train_iter = INTERVALS_PER_SECOND * 
> >SECONDS_TO_TRAIN_FOR;
> >+
> >+	struct stats_data *w = &ep_params->wrk_data;
> >+
> >+	*eptr = ep_params;
> >+
> >+	/* initialize all wrk_stats state */
> >+	for (i = 0; i < NUM_NODES; i++) {
> >+
> >+		if (rte_lcore_is_enabled(i) == 0)
> >+			continue;
> >+		/*init the freqs table */
> >+		total_avail_freqs[i] = rte_power_freqs(i,
> >+				avail_freqs[i],
> >+				NUM_FREQS);
> >+
> >+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
> >+				total_avail_freqs[i],
> >+				i);
> >+
> >+		if (get_freq_index(LOW) > total_avail_freqs[i])
> >+			return -1;
> >+
> >+		if (rte_get_master_lcore() != i) {
> >+			w->wrk_stats[i].lcore_id = i;
> >+			set_policy(&w->wrk_stats[i], policy);
> >+		}
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void)
> >+{
> >+
> >+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> >+
> >+	if (ep_params != NULL)
> >+		rte_free(ep_params);
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	poll_stats->empty_dequeues++;
> >+
> >+	return 0;
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> >+{
> >+
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	poll_stats->num_dequeue_pkts += nb_pkt;
> >+
> >+	return 0;
> >+}
> >+
> >+
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	return poll_stats->empty_dequeues;
> >+}
> >+
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	return poll_stats->num_dequeue_pkts;
> >+}
> >diff --git a/lib/librte_power/rte_power_empty_poll.h 
> >b/lib/librte_power/rte_power_empty_poll.h
> >new file mode 100644
> >index 0000000..ae27f7d
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.h
> >@@ -0,0 +1,219 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#ifndef _RTE_EMPTY_POLL_H
> >+#define _RTE_EMPTY_POLL_H
> >+
> >+/**
> >+ * @file
> >+ * RTE Power Management
> >+ */
> >+#include <stdint.h>
> >+#include <stdbool.h>
> >+
> >+#include <rte_common.h>
> >+#include <rte_byteorder.h>
> >+#include <rte_log.h>
> >+#include <rte_string_fns.h>
> >+#include <rte_power.h>
> >+#include <rte_timer.h>
> >+
> >+#ifdef __cplusplus
> >+extern "C" {
> >+#endif
> >+
> >+#define NUM_FREQS 20
> 
> I don't think this is enough. Suggest using RTE_MAX_LCORE_FREQS
> 
agree
> >+
> >+#define BINS_AV 4 /* Has to be ^2 */
> >+
> >+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> >+
> >+#define NUM_PRIORITIES          2
> >+
> >+#define NUM_NODES         256  /* Max core number*/
> >+
> >+/* Processor Power State */
> >+enum freq_val {
> >+	LOW,
> >+	MED,
> >+	HGH,
> >+	NUM_FREQ = NUM_FREQS
> >+};
> >+
> 
> Why is NUM_FREQ in this enum? 0,1,2,20 (or RTE_MAX_LCORE_FREQS) does not 
> seem right.
> If you are using NUM_FREQ in the code then why not just use NUM_FREQS.
> 
the reason is to have some spare space in case we need extend to more stage.

> >+
> >+/* Queue Polling State */
> >+enum queue_state {
> >+	TRAINING, /* NO TRAFFIC */
> >+	MED_NORMAL,   /* MED */
> >+	HGH_BUSY,     /* HIGH */
> >+	LOW_PURGE,    /* LOW */
> >+};
> >+
> >+/* Queue Stats */
> >+struct freq_threshold {
> >+
> >+	uint64_t base_edpi;
> >+	bool trained;
> >+	uint32_t threshold_percent;
> >+	uint32_t cur_train_iter;
> >+};
> >+
> >+/* Each Worder Thread Empty Poll Stats */
> >+struct priority_worker {
> >+
> >+	/* Current dequeue and throughput counts */
> >+	/* These 2 are written to by the worker threads */
> >+	/* So keep them on their own cache line */
> >+	uint64_t empty_dequeues;
> >+	uint64_t num_dequeue_pkts;
> >+
> >+	enum queue_state queue_state;
> >+
> >+	uint64_t empty_dequeues_prev;
> >+	uint64_t num_dequeue_pkts_prev;
> >+
> >+	/* Used for training only */
> >+	struct freq_threshold thresh[NUM_FREQ];
> >+	enum freq_val cur_freq;
> >+
> >+	/* bucket arrays to calculate the averages */
> >+	uint64_t edpi_av[BINS_AV];
> >+	uint32_t  ec;
> >+	uint64_t ppi_av[BINS_AV];
> >+	uint32_t  pc;
> >+
> >+	uint32_t lcore_id;
> >+	uint32_t iter_counter;
> >+	uint32_t threshold_ctr;
> >+	uint32_t display_ctr;
> >+	uint8_t  dev_id;
> >+
> >+} __rte_cache_aligned;
> >+
> 
> Suggest adding a comment on each of the variables above explaining what 
> the acronym means.
> E.g. edpi, ec, pc, ppi.
> 
agree
> 
> >+
> >+struct stats_data {
> >+
> >+	struct priority_worker wrk_stats[NUM_NODES];
> >+
> >+	/* flag to stop rx threads processing packets until training over */
> >+	bool start_rx;
> >+
> >+};
> >+
> >+/* Empty Poll Parameters */
> >+struct ep_params {
> >+
> >+	/* Timer related stuff */
> >+	uint64_t interval_ticks;
> >+	uint32_t max_train_iter;
> >+
> >+	struct rte_timer timer0;
> >+	struct stats_data wrk_data;
> >+};
> >+
> >+
> >+/* Sample App Init information */
> >+struct ep_policy {
> >+
> >+	uint64_t med_base_edpi;
> >+	uint64_t hgh_base_edpi;
> >+
> >+	enum queue_state state;
> >+};
> >+
> >+
> >+
> >+/**
> >+ * Initialize the power management system.
> >+ *
> >+ * @param eptr
> >+ *   the structure of empty poll configuration
> >+ * @freq_tlb
> >+ *   the power state/frequency  mapping table
> >+ * @policy
> >+ *   the initialization policy from sample app
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+		struct ep_policy *policy);
> >+
> >+/**
> >+ * Free the resource hold by power management system.
> >+ */
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void);
> >+
> >+/**
> >+ * Update specific core empty poll counter
> >+ * It's not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >+
> >+/**
> >+ * Update specific core valid poll counter, not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id.
> >+ * @param nb_pkt
> >+ *  The packet number of one valid poll.
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >+
> >+/**
> >+ * Fetch specific core empty poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore empty poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Fetch specific core valid poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore valid poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Empty poll  state change detection function
> >+ *
> >+ * @param  tim
> >+ *  The timer structure
> >+ * @param  arg
> >+ *  The customized parameter
> >+ */
> >+void  __rte_experimental
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >+
> >+#ifdef __cplusplus
> >+}
> >+#endif
> >+
> >+#endif
> >diff --git a/lib/librte_power/rte_power_version.map 
> >b/lib/librte_power/rte_power_version.map
> >index dd587df..11ffdfb 100644
> >--- a/lib/librte_power/rte_power_version.map
> >+++ b/lib/librte_power/rte_power_version.map
> >@@ -33,3 +33,16 @@ DPDK_18.08 {
> >  	rte_power_get_capabilities;
> >  
> >  } DPDK_17.11;
> >+
> >+EXPERIMENTAL {
> >+        global:
> >+
> >+        rte_power_empty_poll_stat_init;
> >+        rte_power_empty_poll_stat_free;
> >+        rte_power_empty_poll_stat_update;
> >+        rte_power_empty_poll_stat_fetch;
> >+        rte_power_poll_stat_fetch;
> >+        rte_power_poll_stat_update;
> >+        rte_empty_poll_detection;
> >+
> >+};
> 
> checkpatch has several warnings:
> 
> 
> 
> ### lib/librte_power: traffic pattern aware power control
> 
> WARNING:LONG_LINE: line over 80 characters
> #355: FILE: lib/librte_power/rte_power_empty_poll.c:199:
> + poll_stats->thresh[freq].base_edpi / 2000;
> 
> WARNING:LONG_LINE: line over 80 characters
> #417: FILE: lib/librte_power/rte_power_empty_poll.c:261:
> + poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> total: 0 errors, 2 warnings, 802 lines checked
> ERROR: symbol rte_empty_poll_detection is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_fetch is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_free is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_init is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_update is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_fetch is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_update is added in a section other 
> than the EXPERIMENTAL section of the version map
> Warning in /lib/librte_power/rte_power_empty_poll.c:
> are you sure you want to add the following:
> rte_panic\(
> 
version map checking script(awk script to parse section name) has some issue here. 
> 
> Rgds,
> Dave.
> 
> 
> 
> 
> 
> 


More information about the dev mailing list