[RFC 4/4] net/af_packet: add VPP-style prefetching to receive path
Stephen Hemminger
stephen at networkplumber.org
Thu Jan 29 02:06:12 CET 2026
On Wed, 28 Jan 2026 09:30:20 -0800
Stephen Hemminger <stephen at networkplumber.org> wrote:
> Implement the single/dual/quad loop design pattern from FD.IO VPP to
> improve cache efficiency in the af_packet PMD receive path.
>
> The original implementation processes packets one at a time in a simple
> loop, which can result in cache misses when accessing frame headers and
> packet data. The new implementation:
>
> - Processes packets in batches of 4 (quad), 2 (dual), and 1 (single)
> - Prefetches next batch of frame headers while processing current batch
> - Prefetches packet data before memcpy to hide memory latency
> - Reduces loop overhead through partial unrolling
>
> Two helper functions are introduced:
> - af_packet_get_frame(): Returns frame pointer at index with wraparound
> - af_packet_rx_one(): Common per-packet processing (mbuf alloc, memcpy,
> VLAN handling, timestamp offload)
>
> The quad loop checks availability of all 4 frames before processing,
> falling through to dual/single loops when fewer frames are ready. Early
> exit paths (out_advance1/2/3) ensure correct frame index tracking when
> mbuf allocation fails mid-batch.
>
> Prefetch strategy:
> - Frame headers: prefetch N+4..N+7 while processing N..N+3
> - Packet data: prefetch at tp_mac offset before memcpy
>
> This pattern is well-established in high-performance packet processing
> and should improve throughput by better utilizing CPU cache hierarchy,
> particularly beneficial when processing bursts of packets.
>
> Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
This and previous proposal to prefetch have no impact on performance.
Rolled a simple perf test and all three versions come out the same.
The bottleneck is not here, probably at system call and copies now.
Original Prefetch Quad/Dual
TX 1.427 Mpps 1.426 Mpps 1.426 Mpps
RX 0.529 Mpps 0.530 Mpps 0.533 Mpps
loss 87.93% 87.98% 88.0%
Original Prefetch Quad/Dual
TX 1.427 Mpps 1.426 Mpps 1.426 Mpps
RX 0.529 Mpps 0.530 Mpps 0.533 Mpps
loss 87.93% 87.98% 88.0%
Will put the test in the next version of this series, and
drop this patch.
More information about the dev
mailing list