[PATCH v3 3/7] eal: add lcore variable performance test
    Morten Brørup 
    mb at smartsharesystems.com
       
    Fri Sep 13 16:40:12 CEST 2024
    
    
  
> From: Jerin Jacob [mailto:jerinjacobk at gmail.com]
> Sent: Friday, 13 September 2024 13.24
> 
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors at lysator.liu.se>
> wrote:
> >
> > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors at lysator.liu.se>
> wrote:
> > >>
> > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > >>> <mattias.ronnblom at ericsson.com> wrote:
> > >>>> +static double
> > >>>> +benchmark_access_method(void (*init_fun)(void), void
> (*update_fun)(void))
> > >>>> +{
> > >>>> +       uint64_t i;
> > >>>> +       uint64_t start;
> > >>>> +       uint64_t end;
> > >>>> +       double latency;
> > >>>> +
> > >>>> +       init_fun();
> > >>>> +
> > >>>> +       start = rte_get_timer_cycles();
> > >>>> +
> > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > >>>> +               update_fun();
> > >>>> +
> > >>>> +       end = rte_get_timer_cycles();
> > >>>
> > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > >>
> > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > >
> > > I was thinking in another way, with 1e7 iteration, the additional
> > > barrier on precise will be amortized, and we get more _deterministic_
> > > behavior e.s.p in case if we print cycles and if we need to catch
> > > regressions.
> >
> > If you time a section of code which spends ~40000000 cycles, it doesn't
> > matter if you add or remove a few cycles at the beginning and the end.
> >
> > The rte_rdtsc_precise() is both better (more precise in the sense of
> > more serialization), and worse (because it's more costly, and thus more
> > intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
> >
> > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.
No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.
So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.
Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.
Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:
        for (i = 0; i < ITERATIONS / 100; i++) {
                update_fun();
                update_fun();
                ... repeated 100 times
        }
> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>         end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;
This calculation uses integer arithmetic, which will round down the resulting latency.
Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;
> 
>         return latency;
>  }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>         printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>         return TEST_SUCCESS;
>  }
    
    
More information about the dev
mailing list