[dpdk-dev] TX performance regression caused by the mbuf cachline split

Paul Emmerich emmericp at net.in.tum.de
Tue May 12 00:32:05 CEST 2015


Paul Emmerich:
> I naively tried to move the pool pointer into the first cache line in
> the v2.0.0 tag and the performance actually decreased, I'm not yet sure
> why this happens. There are probably assumptions about the cacheline
> locations and prefetching in the code that would need to be adjusted.

This happens because the next-pointer in the mbuf is touched almost 
everywhere, even for mbufs with only one segment because it is used to 
determine if there is another segment (instead of using the nb_segs field).

I guess a solution for me would be to use a custom layout that is 
optimized for tx. I can shrink ol_flags to 32 bits and move the seqn and 
hash fields to the second cache line. A quick-and-dirty test shows that 
this even gives me a slightly higher performance than DPDK 1.7 in the 
full-featured tx path.
This is probably going to break the vector rx/tx path, but I can't use 
that anyways since I always need offloading features (timestamping and 
checksums).

I'll have to see how this affects the rx path. But I value tx 
performance over rx performance. My rx logic is usually very simple.

This solution is kind of ugly. I would prefer to be able to use an 
unmodified version of DPDK :/


By the way, I think there is something wrong with this assumption in 
commit f867492346bd271742dd34974e9cf8ac55ddb869:
> The general approach that we are looking to take is to focus the first
> cache line on fields that are updated on RX , so that receive only deals
> with one cache line.

I think this might be wrong due to the next pointer. I'll probably build 
a simple rx-only benchmark in a few weeks or so. I suspect that it will 
also be significantly slower. But that should be fixable.


Paul


More information about the dev mailing list