[PATCH v3 3/7] eal: add lcore variable performance test
Jerin Jacob
jerinjacobk at gmail.com
Wed Sep 18 12:04:58 CEST 2024
On Mon, Sep 16, 2024 at 4:20 PM Mattias Rönnblom <hofors at lysator.liu.se> wrote:
>
> On 2024-09-13 13:23, Jerin Jacob wrote:
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors at lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 17:11, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors at lysator.liu.se> wrote:
> >>>>
> >>>> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>>>> <mattias.ronnblom at ericsson.com> wrote:
> >>>>>>
> >>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>>>> that the overhead isn't significantly greater than alternative
> >>>>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>>>> (i.e., when plenty of cache is available compared to the working set
> >>>>>> size of the per-lcore data).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
> >>>>>> ---
> >>>>>> app/test/meson.build | 1 +
> >>>>>> app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>>> 2 files changed, 161 insertions(+)
> >>>>>> create mode 100644 app/test/test_lcore_var_perf.c
> >>>>>
> >>>>>
> >>>>>> +static double
> >>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >>>>>> +{
> >>>>>> + uint64_t i;
> >>>>>> + uint64_t start;
> >>>>>> + uint64_t end;
> >>>>>> + double latency;
> >>>>>> +
> >>>>>> + init_fun();
> >>>>>> +
> >>>>>> + start = rte_get_timer_cycles();
> >>>>>> +
> >>>>>> + for (i = 0; i < ITERATIONS; i++)
> >>>>>> + update_fun();
> >>>>>> +
> >>>>>> + end = rte_get_timer_cycles();
> >>>>>
> >>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>>>
> >>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >>>
> >>> I was thinking in another way, with 1e7 iteration, the additional
> >>> barrier on precise will be amortized, and we get more _deterministic_
> >>> behavior e.s.p in case if we print cycles and if we need to catch
> >>> regressions.
> >>
> >> If you time a section of code which spends ~40000000 cycles, it doesn't
> >> matter if you add or remove a few cycles at the beginning and the end.
> >>
> >> The rte_rdtsc_precise() is both better (more precise in the sense of
> >> more serialization), and worse (because it's more costly, and thus more
> >> intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> What you are interested is primarily the impact of (instruction)
> throughput, not the latency of the sequence of instructions that must be
> retired in order to load the lcore variable values, when you switch from
> (say) lcore id-index static arrays to lcore variables in your module.
>
> Usually, there is not reason to make a distinction between latency and
> throughput in this context, but as you zoom into very short snippets of
> code being executed, the difference becomes relevant. For example,
> adding an div instruction won't necessarily add 12 cc to your program's
> execution time on a Zen 4, even though that is its latency. Rather, the
> effects may, depending on data dependencies and what other instructions
> are executed in parallel, be much smaller.
>
> So, one could argue the ILP you get with the loop is a feature, not a bug.
>
> With or without per-iteration latency measurements, these benchmark are
> not-very-useful at best, and misleading at worst. I will rework them to
> include more than a single module/lcore variable, which I think would be
> somewhat of an improvement.
OK. Module parameter will remove the compiler optimization and more accurate.
I was doing manual loop unrolling[1] in a trace test case(for small
inline functions)
Either way it fine. Thanks for the rework.
[1]
https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30
>
> Even better would have some real domain logic, instead of just a dummy
> multiplication.
>
> >>
> >> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> >> doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> > end = rte_get_timer_cycles();
> >
> > - latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > + latency = ((end - start)) / ITERATIONS;
> >
> > return latency;
> > }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > - printf("Latencies [ns/update]\n");
> > + printf("Latencies [cycles/update]\n");
> > printf("Thread-local storage Static array Lcore variables\n");
> > - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > - sarray_latency * 1e9, lvar_latency * 1e9);
> > + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> > return TEST_SUCCESS;
> > }
> >
> >
> > Patch 2: Change to precise with calibration
> > -----------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..8142ecd56241 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -96,23 +96,28 @@ lvar_update(void)
> > static double
> > benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> > {
> > - uint64_t i;
> > + double tsc_latency;
> > + double latency;
> > uint64_t start;
> > uint64_t end;
> > - double latency;
> > + uint64_t i;
> >
> > - init_fun();
> > + /* calculate rte_rdtsc_precise overhead */
> > + start = rte_rdtsc_precise();
> > + end = rte_rdtsc_precise();
> > + tsc_latency = (end - start);
> >
> > - start = rte_get_timer_cycles();
> > + init_fun();
> >
> > - for (i = 0; i < ITERATIONS; i++)
> > + latency = 0;
> > + for (i = 0; i < ITERATIONS; i++) {
> > + start = rte_rdtsc_precise();
> > update_fun();
> > + end = rte_rdtsc_precise();
> > + latency += (end - start) - tsc_latency;
> > + }
> >
> > - end = rte_get_timer_cycles();
> > -
> > - latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > -
> > - return latency;
> > + return latency / (double)ITERATIONS;
> > }
> >
> > static int
> > @@ -135,10 +140,9 @@ test_lcore_var_access(void)
> > sarray_latency = benchmark_access_method(sarray_init, sarray_update);
> > lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> >
> > - printf("Latencies [ns/update]\n");
> > + printf("Latencies [cycles/update]\n");
> > printf("Thread-local storage Static array Lcore variables\n");
> > - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > - sarray_latency * 1e9, lvar_latency * 1e9);
> > + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> > return TEST_SUCCESS;
> > }
> >
> > ARM N2 core with patch 1(aka current scheme)
> > -----------------------------------
> >
> > + ------------------------------------------------------- +
> > + Test Suite : lcore variable perf autotest
> > + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage Static array Lcore variables
> > 7.0 7.0 7.0
> >
> >
> > ARM N2 core with patch 2
> > -----------------------------------
> >
> > + ------------------------------------------------------- +
> > + Test Suite : lcore variable perf autotest
> > + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage Static array Lcore variables
> > 11.4 15.5 15.5
> >
> > x86 i9 core with patch 1(aka current scheme)
> > ------------------------------------------------------------
> >
> > + ------------------------------------------------------- +
> > + Test Suite : lcore variable perf autotest
> > + ------------------------------------------------------- +
> > Latencies [ns/update]
> > Thread-local storage Static array Lcore variables
> > 5.0 6.0 6.0
> >
> > x86 i9 core with patch 2
> > --------------------------------
> > + ------------------------------------------------------- +
> > + Test Suite : lcore variable perf autotest
> > + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage Static array Lcore variables
> > 5.3 10.6 11.7
> >
> >
> >
> >
> >
> >>
> >>> Furthermore, you may consider replacing rte_random() in fast path to
> >>> running number or so if it is not deterministic in cycle computation.
> >>
> >> rte_rand() is not used in the fast path. I don't understand what you
> >
> > I missed that. Ignore this comment.
> >
> >> mean by "running number".
More information about the dev
mailing list