[PATCH v3 3/7] eal: add lcore variable performance test
    Jerin Jacob 
    jerinjacobk at gmail.com
       
    Mon Sep 16 10:12:02 CEST 2024
    
    
  
On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <mb at smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk at gmail.com]
> > Sent: Friday, 13 September 2024 13.24
> >
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors at lysator.liu.se>
> > wrote:
> > >
> > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors at lysator.liu.se>
> > wrote:
> > > >>
> > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > >>> <mattias.ronnblom at ericsson.com> wrote:
> > > >>>> +static double
> > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > (*update_fun)(void))
> > > >>>> +{
> > > >>>> +       uint64_t i;
> > > >>>> +       uint64_t start;
> > > >>>> +       uint64_t end;
> > > >>>> +       double latency;
> > > >>>> +
> > > >>>> +       init_fun();
> > > >>>> +
> > > >>>> +       start = rte_get_timer_cycles();
> > > >>>> +
> > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > >>>> +               update_fun();
> > > >>>> +
> > > >>>> +       end = rte_get_timer_cycles();
> > > >>>
> > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > >>
> > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > >
> > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > regressions.
> > >
> > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > matter if you add or remove a few cycles at the beginning and the end.
> > >
> > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > more serialization), and worse (because it's more costly, and thus more
> > > intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> > >
> > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
>
> No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.
No. I am not disagreeing. That why I said, “Yes. In this setup”.
All I am saying, there is a more accurate way of doing measurement for
this test along with “data” at
https://mails.dpdk.org/archives/dev/2024-September/301227.html
>
> So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.
>
> Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.
>
> Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:
>
>         for (i = 0; i < ITERATIONS / 100; i++) {
>                 update_fun();
>                 update_fun();
>                 ... repeated 100 times
I have done a similar scheme for trace perf for inline function test
at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30
Either the above scheme or the below scheme needs to be used as
mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html
+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }
>         }
>
>
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >         end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
>
> This calculation uses integer arithmetic, which will round down the resulting latency.
> Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;
Yup. It is in patch 2
https://mails.dpdk.org/archives/dev/2024-September/301227.html
>
> >
> >         return latency;
> >  }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >         printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >         return TEST_SUCCESS;
> >  }
    
    
More information about the dev
mailing list