[dpdk-dev] TX performance regression caused by the mbuf cachline split

Paul Emmerich emmericp at net.in.tum.de
Mon May 11 02:14:58 CEST 2015


Hi,

this is a follow-up to my post from 3 weeks ago [1]. I'm starting a new 
thread here since I now got a completely new test setup for improved 
reproducibility.

Background for anyone that didn't catch my last post:
I'm investigating a performance regression in my packet generator [2] 
that occurs since I tried to upgrade from DPDK 1.7.1 to 1.8 or 2.0. DPDK 
1.7.1 is about 25% faster than 2.0 in my application.
I suspected that this is due to the new 2-cacheline mbufs, which I now 
confirmed with a bisect.

My old test setup was based on the l2fwd example and required an 
external packet generator and was kind of hard to reproduce.

I built a simple tx benchmark application that simply sends nonsensical 
packets with a sequence number as fast as possible on two ports with a 
single single core. You can download the benchmark app at [3].

Hardware setup:
CPU: E5-2620 v3 underclocked to 1.2 GHz
RAM: 4x 8 GB 1866 MHz DDR4 memory
NIC: X540-T2


Baseline test results:

DPDK   simple tx  full-featured tx
1.7.1  14.1 Mpps  10.7 Mpps
2.0.0  11.0 Mpps   9.3 Mpps

DPDK 1.7.1 is 28%/15% faster than 2.0 with simple/full-featured tx in 
this benchmark.


I then did a few runs of git bisect to identify commits that caused a 
significant drop in performance. You can find the script that I used to 
quickly test the performance of a version at [4].


Commit                                    simple  full-featured
7869536f3f8edace05043be6f322b835702b201c  13.9    10.4
mbuf: flatten struct vlan_macip

The commit log explains that there is a perf regression and that it 
cannot be avoided to be future-compatible. The log claims < 5% which is 
consistent with my test results (old code is 4% faster). I guess that is 
okay and cannot be avoided.


Commit                                    simple  full-featured
08b563ffb19d8baf59dd84200f25bc85031d18a7  12.8    10.4
mbuf: replace data pointer by an offset

This affects the simple tx path significantly.
This performance regression is probably simply be caused by the 
(temporarily) disabled vector tx code that is mentioned in the commit 
log. Not investigated further.



Commit                                    simple  full-featured
f867492346bd271742dd34974e9cf8ac55ddb869  10.7    9.1
mbuf: split mbuf across two cache lines.

This one is the real culprit.
The commit log does not mention any performance evaluations and a quick 
scan of the mailing list also doesn't reveal any evaluations of the 
impact of this change.

It looks like the main problem for tx is that the mempool pointer is in 
the second cacheline.

I think the new mbuf structure is too bloated. It forces you to pay for 
features that you don't need or don't want. I understand that it needs 
to support all possible filters and offload features. But it's kind of 
hard to justify 25% difference in performance for a framework that sets 
performance above everything (Does it? I Picked that up from the 
discussion in the "Beyond DPDK 2.0" thread).

I've counted 56 bytes in use in the first cacheline in v2.0.0.

Would it be possible to move the pool pointer and tx offload fields to 
the first cacheline?

We would just need to free up 8 bytes. One candidate would be the seqn 
field, does it really have to be in the first cache line? Another 
candidate is the size of the ol_flags field? Do we really need 64 flags? 
Sharing bits between rx and tx worked fine.


I naively tried to move the pool pointer into the first cache line in 
the v2.0.0 tag and the performance actually decreased, I'm not yet sure 
why this happens. There are probably assumptions about the cacheline 
locations and prefetching in the code that would need to be adjusted.


Another possible solution would be a more dynamic approach to mbufs: the 
mbuf struct could be made configurable to fit the requirements of the 
application. This would probably require code generation or a lot of 
ugly preprocessor hacks and add a lot of complexity to the code.
The question would be if DPDK really values performance above everything 
else.


Paul


P.S.: I'm kind of disappointed by the lack of regression tests for the 
performance. I think that such tests should be an integral part of a 
framework with the explicit goal to be fast. For example, the main page 
at dpdk.org claims a performance of "usually less than 80 cycles" for a 
rx or tx operation. This claim is no longer true :(
Touching the layout of a core data structure like the mbuf shouldn't be 
done without carefully evaluating the performance impacts.
But this discussion probably belongs in the "Beyond DPDK 2.0" thread.


P.P.S.: Benchmarking an rx-only application (e.g. traffic analysis) 
would also be interesting, but that's not really on my todo list right 
now. Mixed rx/tx like forwarding is also affected as discussed in my 
last thread [1]).

[1] http://dpdk.org/ml/archives/dev/2015-April/016921.html
[2] https://github.com/emmericp/MoonGen
[3] https://github.com/emmericp/dpdk-tx-performance
[4] https://gist.github.com/emmericp/02c5885908c3cb5ac5b7


More information about the dev mailing list