[PATCH] net/af_packet: cache align Rx/Tx structs
Mattias Rönnblom
hofors at lysator.liu.se
Thu Apr 25 00:27:36 CEST 2024
On 2024-04-24 21:13, Stephen Hemminger wrote:
> On Wed, 24 Apr 2024 18:50:50 +0100
> Ferruh Yigit <ferruh.yigit at amd.com> wrote:
>
>>> I don't know how slow af_packet is, but if you care about performance,
>>> you don't want to use atomic add for statistics.
>>>
>>
>> There are a few soft drivers already using atomics adds for updating stats.
>> If we document expectations from 'rte_eth_stats_reset()', we can update
>> those usages.
>
> Using atomic add is lots of extra overhead. The statistics are not guaranteed
> to be perfect. If nothing else, the bytes and packets can be skewed.
>
The sad thing here is that in case the counters are reset within the
load-modify-store cycle of the lcore counter update, the reset may end
up being a nop. So, it's not like you missed a packet or two, or suffer
some transient inconsistency, but you completed and permanently ignored
the reset request.
> The soft drivers af_xdp, af_packet, and tun performance is dominated by the
> overhead of the kernel system call and copies. Yes, alignment is good
> but won't be noticeable.
There aren't any syscalls in the RX path in the af_packet PMD.
I added the same statistics updates as the af_packet PMD uses into an
benchmark app which consumes ~1000 cc in-between stats updates.
If the equivalent of the RX queue struct was cache aligned, the
statistics overhead was so small it was difficult to measure. Less than
3-4 cc per update. This was with volatile, but without atomics.
If the RX queue struct wasn't cache aligned, and sized so a cache line
generally was used by two (neighboring) cores, the stats incurred a cost
of ~55 cc per update.
Shaving off 55 cc should translate to a couple of hundred percent
increased performance for an empty af_packet poll. If your lcore has
some other primary source of work than the af_packet RX queue, and the
RX queue is polled often, then this may well be a noticeable gain.
The benchmark was run on 16 Gracemont cores, which in my experience
seems to have a little shorter core-to-core latency than many other
systems, provided the remote core/cache line owner is located in the
same cluster.
More information about the dev
mailing list