[PATCH v1 1/2] eal: add lcore busyness telemetry
Bruce Richardson
bruce.richardson at intel.com
Fri Jul 15 16:11:46 CEST 2022
On Fri, Jul 15, 2022 at 07:16:17PM +0530, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.burakov at intel.com> wrote:
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> > - ethdev
> > - cryptodev
> > - compressdev
> > - regexdev
> > - bbdev
> > - rawdev
> > - eventdev
> > - dmadev
> > - Some additional libraries:
> > - ring
> > - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz at intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh at intel.com>
> > Signed-off-by: David Hunt <david.hunt at intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>
>
> It is a good feature. Thanks for this work.
>
+1
Some follow-up comments inline below.
/Bruce
>
> > ---
> >
> > Notes:
> > We did a couple of quick smoke tests to see if this patch causes any performance
> > degradation, and it seemed to have none that we could measure. Telemetry can be
> > disabled at compile time via a config option, while at runtime it can be
> > disabled, seemingly at a cost of one additional branch.
> >
> > That said, our benchmarking efforts were admittedly not very rigorous, so
> > comments welcome!
>
> >
> > diff --git a/config/rte_config.h b/config/rte_config.h
> > index 46549cb062..583cb6f7a5 100644
> > --- a/config/rte_config.h
> > +++ b/config/rte_config.h
> > @@ -39,6 +39,8 @@
> > #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
> > #define RTE_BACKTRACE 1
> > #define RTE_MAX_VFIO_CONTAINERS 64
> > +#define RTE_LCORE_BUSYNESS 1
>
> Please don't enable debug features in fastpath as default.
>
I would disagree that this is a debug feature. The number of times I have
heard from DPDK users that they wish to get more visibility into what the
app is doing rather than seeing 100% cpu busy. Therefore, I'd see this as
enabled by default rather than disabled - unless it's shown to have a
performance regression.
That said, since this impacts multiple components and it's something that
end users might want to disable at build-time, I'd suggest moving it to
meson_options.txt file and have it as an official DPDK build-time option.
> > +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
> >
> > +
> > +#include <unistd.h>
> > +#include <limits.h>
> > +#include <string.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_cycles.h>
> > +#include <rte_errno.h>
> > +#include <rte_lcore.h>
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
>
> This clutter may not be required. Let it compile all cases.
>
> > +#include <rte_telemetry.h>
> > +#endif
> > +
> > +int __rte_lcore_telemetry_enabled;
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +
> > +struct lcore_telemetry {
> > + int busyness;
> > + /**< Calculated busyness (gets set/returned by the API) */
> > + int raw_busyness;
> > + /**< Calculated busyness times 100. */
> > + uint64_t interval_ts;
> > + /**< when previous telemetry interval started */
> > + uint64_t empty_cycles;
> > + /**< empty cycle count since last interval */
> > + uint64_t last_poll_ts;
> > + /**< last poll timestamp */
> > + bool last_empty;
> > + /**< if last poll was empty */
> > + unsigned int contig_poll_cnt;
> > + /**< contiguous (always empty/non empty) poll counter */
> > +} __rte_cache_aligned;
> > +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
>
> Allocate this from hugepage.
>
Yes, whether or not it's allocated from hugepages, dynamic allocation would
be better than having static vars.
> > +
> > +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> > +{
> > + const unsigned int lcore_id = rte_lcore_id();
> > + uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> > + struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> > + const bool empty = nb_rx == 0;
> > + uint64_t diff_int, diff_last;
> > + bool last_empty;
> > +
> > + last_empty = tdata->last_empty;
> > +
> > + /* optimization: don't do anything if status hasn't changed */
> > + if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> > + return;
> > + /* status changed or we're waiting for too long, reset counter */
> > + tdata->contig_poll_cnt = 0;
> > +
> > + cur_tsc = rte_rdtsc();
> > +
> > + interval_ts = tdata->interval_ts;
> > + empty_cycles = tdata->empty_cycles;
> > + last_poll_ts = tdata->last_poll_ts;
> > +
> > + diff_int = cur_tsc - interval_ts;
> > + diff_last = cur_tsc - last_poll_ts;
> > +
> > + /* is this the first time we're here? */
> > + if (interval_ts == 0) {
> > + tdata->busyness = LCORE_BUSYNESS_MIN;
> > + tdata->raw_busyness = 0;
> > + tdata->interval_ts = cur_tsc;
> > + tdata->empty_cycles = 0;
> > + tdata->contig_poll_cnt = 0;
> > + goto end;
> > + }
> > +
> > + /* update the empty counter if we got an empty poll earlier */
> > + if (last_empty)
> > + empty_cycles += diff_last;
> > +
> > + /* have we passed the interval? */
> > + if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
>
>
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,
>
> > + int raw_busyness;
> > +
> > + /* get updated busyness value */
> > + raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
> > +
> > + /* set a new interval, reset empty counter */
> > + tdata->interval_ts = cur_tsc;
> > + tdata->empty_cycles = 0;
> > + tdata->raw_busyness = raw_busyness;
> > + /* bring busyness back to 0..100 range, biased to round up */
> > + tdata->busyness = (raw_busyness + 50) / 100;
> > + } else
> > + /* we may have updated empty counter */
> > + tdata->empty_cycles = empty_cycles;
> > +
> > +end:
> > + /* update status for next poll */
> > + tdata->last_poll_ts = cur_tsc;
> > + tdata->last_empty = empty;
> > +}
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> > + do { \
> > + if (__rte_lcore_telemetry_enabled) \
>
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.
>
Surely that requires much more complicated tooling? How would that work in
this situation?
>
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.
>
> Just 2c.
>
> > + __rte_lcore_telemetry_timestamp(nb_rx); \
> > + } while (0)
> > +#else
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> > + while (0)
> > +#endif
> > +
More information about the dev
mailing list