[dpdk-dev] TX performance regression caused by the mbuf cachline split
Paul Emmerich
emmericp at net.in.tum.de
Tue May 12 00:32:05 CEST 2015
Paul Emmerich:
> I naively tried to move the pool pointer into the first cache line in
> the v2.0.0 tag and the performance actually decreased, I'm not yet sure
> why this happens. There are probably assumptions about the cacheline
> locations and prefetching in the code that would need to be adjusted.
This happens because the next-pointer in the mbuf is touched almost
everywhere, even for mbufs with only one segment because it is used to
determine if there is another segment (instead of using the nb_segs field).
I guess a solution for me would be to use a custom layout that is
optimized for tx. I can shrink ol_flags to 32 bits and move the seqn and
hash fields to the second cache line. A quick-and-dirty test shows that
this even gives me a slightly higher performance than DPDK 1.7 in the
full-featured tx path.
This is probably going to break the vector rx/tx path, but I can't use
that anyways since I always need offloading features (timestamping and
checksums).
I'll have to see how this affects the rx path. But I value tx
performance over rx performance. My rx logic is usually very simple.
This solution is kind of ugly. I would prefer to be able to use an
unmodified version of DPDK :/
By the way, I think there is something wrong with this assumption in
commit f867492346bd271742dd34974e9cf8ac55ddb869:
> The general approach that we are looking to take is to focus the first
> cache line on fields that are updated on RX , so that receive only deals
> with one cache line.
I think this might be wrong due to the next pointer. I'll probably build
a simple rx-only benchmark in a few weeks or so. I suspect that it will
also be significantly slower. But that should be fixable.
Paul
More information about the dev
mailing list