[PATCH v4 3/7] eal: add lcore variable performance test
Mattias Rönnblom
hofors at lysator.liu.se
Mon Sep 16 18:12:55 CEST 2024
On 2024-09-16 13:54, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors at lysator.liu.se]
>> Sent: Monday, 16 September 2024 13.13
>>
>> On 2024-09-16 12:52, Mattias Rönnblom wrote:
>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>> that the overhead isn't significantly greater than alternative
>>> approaches, in scenarios where the benefits aren't expected to show up
>>> (i.e., when plenty of cache is available compared to the working set
>>> size of the per-lcore data).
>>>
>>
>> Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
>>
>> + ------------------------------------------------------- +
>> + Test Suite : lcore variable perf autotest
>> + ------------------------------------------------------- +
>> Latencies [TSC cycles/update]
>> Modules/Variables Static array Thread-local Storage Lcore variables
>> 1 3.9 5.5 3.7
>> 2 3.8 5.5 3.8
>> 4 4.9 5.5 3.7
>> 8 3.8 5.5 3.8
>> 16 11.3 5.5 3.7
>> 32 20.9 5.5 3.7
>> 64 23.5 5.5 3.7
>> 128 23.2 5.5 3.7
>> 256 23.5 5.5 3.7
>> 512 24.1 5.5 3.7
>> 1024 25.3 5.5 3.9
>> + TestCase [ 0] : test_lcore_var_access succeeded
>> + ------------------------------------------------------- +
>>
>>
>> The reason for TLS being slower than lcore variables (which in turn
>> relies on TLS for lcore id lookup) is the lazy initialization
>> conditional that is imposed on variant. Could that be avoided (which is
>> module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.
>
> I think you should not assume lazy initialization of TLS in your benchmark.
> Our application uses TLS, and when spinning up a new thread, we call an per-lcore init function of each module before calling the per-lcore run function. This design pattern is also described in Figure 1.4 [1] in the Programmer's Guide.
>
> [1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
>
Per-lcore init functions may be an option, and also may not, depending
on what API you need to adhere to. But maybe I should add non-lazy TLS
variant as well.
I should probably add some information on lcore variables in the EAL
programmer's guide as well.
Non-lazy TLS would be a more viable option if there were proper
framework support for it. Now, I'm not sure there is a better way to do
it in a DPDK library than how it's done for tracing, where there's an
explicit call per thread created. Other DPDK-internal users of
RTE_PER_LCORE seems to depend on lazy initialization.
>>
>> I must say I'm surprised to see lcore variables doing this good, at
>> these very modest working set sizes. Probably, you can stay at near-zero
>> L1 misses with lcore variables (and TLS), but start missing the L1 with
>> static arrays.
>
More information about the dev
mailing list