[PATCH v4 3/7] eal: add lcore variable performance test

Mattias Rönnblom hofors at lysator.liu.se
Mon Sep 16 18:12:55 CEST 2024


On 2024-09-16 13:54, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors at lysator.liu.se]
>> Sent: Monday, 16 September 2024 13.13
>>
>> On 2024-09-16 12:52, Mattias Rönnblom wrote:
>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>> that the overhead isn't significantly greater than alternative
>>> approaches, in scenarios where the benefits aren't expected to show up
>>> (i.e., when plenty of cache is available compared to the working set
>>> size of the per-lcore data).
>>>
>>
>> Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
>>
>>    + ------------------------------------------------------- +
>>    + Test Suite : lcore variable perf autotest
>>    + ------------------------------------------------------- +
>> Latencies [TSC cycles/update]
>> Modules/Variables  Static array  Thread-local Storage  Lcore variables
>>                   1           3.9           5.5              3.7
>>                   2           3.8           5.5              3.8
>>                   4           4.9           5.5              3.7
>>                   8           3.8           5.5              3.8
>>                  16          11.3           5.5              3.7
>>                  32          20.9           5.5              3.7
>>                  64          23.5           5.5              3.7
>>                 128          23.2           5.5              3.7
>>                 256          23.5           5.5              3.7
>>                 512          24.1           5.5              3.7
>>                1024          25.3           5.5              3.9
>>    + TestCase [ 0] : test_lcore_var_access succeeded
>>    + ------------------------------------------------------- +
>>
>>
>> The reason for TLS being slower than lcore variables (which in turn
>> relies on TLS for lcore id lookup) is the lazy initialization
>> conditional that is imposed on variant. Could that be avoided (which is
>> module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.
> 
> I think you should not assume lazy initialization of TLS in your benchmark.
> Our application uses TLS, and when spinning up a new thread, we call an per-lcore init function of each module before calling the per-lcore run function. This design pattern is also described in Figure 1.4 [1] in the Programmer's Guide.
> 
> [1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
> 

Per-lcore init functions may be an option, and also may not, depending 
on what API you need to adhere to. But maybe I should add non-lazy TLS 
variant as well.

I should probably add some information on lcore variables in the EAL 
programmer's guide as well.

Non-lazy TLS would be a more viable option if there were proper 
framework support for it. Now, I'm not sure there is a better way to do 
it in a DPDK library than how it's done for tracing, where there's an 
explicit call per thread created. Other DPDK-internal users of 
RTE_PER_LCORE seems to depend on lazy initialization.

>>
>> I must say I'm surprised to see lcore variables doing this good, at
>> these very modest working set sizes. Probably, you can stay at near-zero
>> L1 misses with lcore variables (and TLS), but start missing the L1 with
>> static arrays.
> 


More information about the dev mailing list