[PATCH] event/eth_tx: prefetch mbuf headers
    Mattias Rönnblom 
    hofors at lysator.liu.se
       
    Fri Jul 11 14:44:07 CEST 2025
    
    
  
On 2025-07-10 17:37, Stephen Hemminger wrote:
> On Fri, 28 Mar 2025 06:43:39 +0100
> Mattias Rönnblom <mattias.ronnblom at ericsson.com> wrote:
> 
>> Prefetch mbuf headers, resulting in ~10% throughput improvement when
>> the Ethernet RX and TX Adapters are hosted on the same core (likely
>> ~2x in case a dedicated TX core is used).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
>> Tested-by: Peter Nilsson <peter.j.nilsson at ericsson.com>
> 
> Prefetching all the mbufs can be counter productive on a big burst.
> 
For the non-vector case, the burst is no larger than 32. From what's 
available in terms of public information, the number of load queue 
entries is 72 on Skylake. What it is on newer micro architecture 
generations, I don't know. So 32 is a lot of prefetches, but at least 
likely smaller than the load queue.
> VPP does something similar but more unrolled.
> See https://fd.io/docs/vpp/v2101/gettingstarted/developers/vnet.html#single-dual-loops
This pattern makes sense, if the do_something_to() function has 
non-trivial latency.
If it doesn't, which I suspect is the case in the TX adapter case, you 
will issue 4 prefetches, of which some or even all aren't resolved 
before the core need to data. Repeat.
Also - and I'm guessing now - the do_something_to() equivalent in the TX 
adapter case is likely not allocating a lot of load buffer entries, so 
little risk of the prefetches being discarded.
That said, I'm sure you can tweak non-vector TXA prefetching to further 
improve performance. For example, it may be little point in prefetching 
the first few mbuf headers, since you will need that data very soon indeed.
I no longer have the setup to further refine this patch. I suggest we 
live with only ~20% performance gain at this point.
For the vector case, I agree this loop may result in too many prefetches.
I can remove prefetching from the vector case, to maintain legacy 
performance. I could also cap the number of prefetches (e.g., to 32).
    
    
More information about the dev
mailing list