[RFC 4/4] net/af_packet: add VPP-style prefetching to receive path

Stephen Hemminger stephen at networkplumber.org
Thu Jan 29 02:06:12 CET 2026

Previous message (by thread): [RFC 4/4] net/af_packet: add VPP-style prefetching to receive path
Next message (by thread): [RFC 4/4] net/af_packet: add VPP-style prefetching to receive path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 28 Jan 2026 09:30:20 -0800
Stephen Hemminger <stephen at networkplumber.org> wrote:

> Implement the single/dual/quad loop design pattern from FD.IO VPP to
> improve cache efficiency in the af_packet PMD receive path.
> 
> The original implementation processes packets one at a time in a simple
> loop, which can result in cache misses when accessing frame headers and
> packet data. The new implementation:
> 
> - Processes packets in batches of 4 (quad), 2 (dual), and 1 (single)
> - Prefetches next batch of frame headers while processing current batch
> - Prefetches packet data before memcpy to hide memory latency
> - Reduces loop overhead through partial unrolling
> 
> Two helper functions are introduced:
> - af_packet_get_frame(): Returns frame pointer at index with wraparound
> - af_packet_rx_one(): Common per-packet processing (mbuf alloc, memcpy,
>   VLAN handling, timestamp offload)
> 
> The quad loop checks availability of all 4 frames before processing,
> falling through to dual/single loops when fewer frames are ready. Early
> exit paths (out_advance1/2/3) ensure correct frame index tracking when
> mbuf allocation fails mid-batch.
> 
> Prefetch strategy:
> - Frame headers: prefetch N+4..N+7 while processing N..N+3
> - Packet data: prefetch at tp_mac offset before memcpy
> 
> This pattern is well-established in high-performance packet processing
> and should improve throughput by better utilizing CPU cache hierarchy,
> particularly beneficial when processing bursts of packets.
> 
> Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>


This and previous proposal to prefetch have no impact on performance.
Rolled a simple perf test and all three versions come out the same.
The bottleneck is not here, probably at system call and copies now.

	Original	Prefetch	Quad/Dual
TX	1.427 Mpps	1.426 Mpps	1.426 Mpps

RX	0.529 Mpps	0.530 Mpps	0.533 Mpps
 loss	87.93%		87.98%		88.0%


	Original	Prefetch	Quad/Dual
TX	1.427 Mpps	1.426 Mpps	1.426 Mpps

RX	0.529 Mpps	0.530 Mpps	0.533 Mpps
 loss	87.93%		87.98%		88.0%


Will put the test in the next version of this series, and
drop this patch.

Previous message (by thread): [RFC 4/4] net/af_packet: add VPP-style prefetching to receive path
Next message (by thread): [RFC 4/4] net/af_packet: add VPP-style prefetching to receive path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list