AF_XDP performance

Stephen Hemminger stephen at networkplumber.org
Wed May 24 20:29:19 CEST 2023

Previous message (by thread): AF_XDP performance
Next message (by thread): [PATCH v3 0/7] replace rte atomics with GCC builtin atomics
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 24 May 2023 17:36:32 +0100
Bruce Richardson <bruce.richardson at intel.com> wrote:

> On Wed, May 24, 2023 at 01:32:17PM +0100, Alireza Sanaee wrote:
> > Hi everyone,
> > 
> > I was looking at this deck of slides
> > https://www.dpdk.org/wp-content/uploads/sites/35/2020/11/XDP_ZC_PMD-1.pdf
> > 
> > I tried to reproduce the results with the testpmd application. I am
> > working with BlueField 2 NIC and I could sustain ~10Mpps with testpmd
> > with AF_XDP, and about 20Mpps without AF_XDP on the RX drop experiment. I
> > was wondering why AF_XDP so lower compared to PCI-e scenario given the
> > fact that both cases are zero-cpy. Is it because of the frame size?
> >   
> While I can't claim to explain all the differences, in short I believe the
> AF_XDP version is just doing more work. With a native DPDK driver, the
> driver takes the packet descriptors directly from the NIC RX ring and uses
> the metadata to construct a packet mbuf, which is returned to the
> application.
> 
> With AF_XDP, however, the NIC descriptor ring is not directly accessible by
> the app. Therefore the processing is (AFAIK):
> * kernel reads NIC descriptor ring and processes descriptor
> * kernel calls BPF program for the received packets to determine what
>   action to take, e.g. forward to socket
> * kernel writes an AF_XDP descriptor to the AF_XDP socket RX ring
> * application reads the AF_XDP ring entry written by the kernel and then
>   creates a DPDK mbuf to return to the application.
> 
> There are also other considerations around potential cache locality of
> descriptors too that could affect things, but I would expect the extra
> descriptor processing work outlined above probably explains most of the
> difference.
> 
> Regards,
> /Bruce

There is also a context switch from kernel polling thread to the DPDK polling thread
to consider. Plus the overhead of running the BPF program.  The context switches
mean that both the instruction and data cache is likely to have lots of misses.
Remember on modern processes the limiting factor is usually memory performance
from caching, not number of instructions.

Previous message (by thread): AF_XDP performance
Next message (by thread): [PATCH v3 0/7] replace rte atomics with GCC builtin atomics
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list