[dpdk-dev] LLC miss in librte_distributor

jigsaw jigsaw at gmail.com
Tue Nov 11 16:37:52 CET 2014

Hi Bruce,

I noticed that librte_distributor has quite sever LLC miss problem when
running on 16 cores.
While on 8 cores, there's no such problem.
The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
cores on 2 sockets.

The test case is the distributor_perf_autotest, i.e.
in app/test/test_distributor_perf.c.
The test result is collected by command:

perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
-cff -n2 --no-huge

Note that test results show that with or without hugepage, the LCC miss
rate remains the same. So I will just show --no-huge config.

With 8 cores, the LLC miss rate is OK:

LLC-load-misses  26750
LLC-loads  93979233
LLC-store-misses  432263
LLC-stores  69954746

That is 0.028% of load miss and 0.62% of store miss.

With 16 cores, the LLC miss rate is very high:

LLC-load-misses  70263520
LLC-loads  143807657
LLC-store-misses  23115990
LLC-stores  63692854

That is 48.9% load miss and 36.3% store miss.

Most of the load miss happens at first line of rte_distributor_poll_pkt.
Most of the store miss happens at ... I don't know, because perf record on
LLC-store-misses brings down my machine.

It's not so straightforward to me how could this happen: 8 core fine, but
16 cores very bad.
My guess is that 16 cores bring in more QPI transaction between sockets?
Or 16 cores bring a different LLC access pattern?

So I tried to reduce the padding inside union rte_distributor_buffer from 3
cachelines to 1 cacheline.

-     char pad[CACHE_LINE_SIZE*3];
+    char pad[CACHE_LINE_SIZE];

And it does have a obvious result:

LLC-load-misses  53159968
LLC-loads  167756282
LLC-store-misses  29012799
LLC-stores  63352541

Now it is 31.69% of load miss, and 45.79% of store miss.

It lows down the load miss rate, but raises the store miss rate.
Both numbers are still very high, sadly.
But the bright side is that it decrease the Time per burst and time per

The original version has:
=== Performance test of distributor ===
Time per burst:  8013
Time per packet: 250

And the patched ver has:
=== Performance test of distributor ===
Time per burst:  6834
Time per packet: 213

I tried a couple of other tricks. Such as adding more idle loops
in rte_distributor_get_pkt,
and making the rte_distributor_buffer thread_local to each worker core. But
none of this trick
has any noticeable outcome. These failures make me tend to believe the high
LLC miss rate
is related to QPI or NUMA. But my machine is not able to perf on uncore QPI
events so this
cannot be approved.

I cannot draw any conclusion or reveal the root cause after all. But I
suggest a further study on the performance bottleneck so as to find a good

thx &

More information about the dev mailing list