[dpdk-dev] LLC miss in librte_distributor

Bruce Richardson bruce.richardson at intel.com
Wed Nov 12 17:07:09 CET 2014


On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
> Hi,
> 
> OK it is now very clear it is due to memory transactions between different
> nodes.
> 
> The test program is here:
> https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
> 
> The test machine topology is:
> 
> NUMA node0 CPU(s):     0-7,16-23
> NUMA node1 CPU(s):     8-15,24-31
> 
> Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
> boost from  0.09% to 33.45%.
> The LLC cache store miss boost from 0.027% to 50.695%.
> 
> Clearly the root cause is transaction crossing the node boundary.
> 
> But then how to resolve this problem is another topic...
> 
> thx &
> rgds,
> -ql
> 
> 

Having traffic cross QPI is always a problem, and there could be a number of ways
to solve it. Probably the best solution is to have multiple NICs with some 
directly connected to each socket, with the packets from each NIC processed locally
on the socket that NIC is connected to.

If that is not possible, then other solutions need to be looked at. E.g. For an app
wanting to use a distributor, I would suggest investigating if two distributors
could be used - one on each socket. Then use a ring to burst-transfer large
groups of packets from one socket to another and then use the distributor locally.
This would involve far less QPI traffic than using a distributor with remote workers.

Regards,
/Bruce

> 
> On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw at gmail.com> wrote:
> 
> > Hi Bruce,
> >
> > I noticed that librte_distributor has quite sever LLC miss problem when
> > running on 16 cores.
> > While on 8 cores, there's no such problem.
> > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> > cores on 2 sockets.
> >
> > The test case is the distributor_perf_autotest, i.e.
> > in app/test/test_distributor_perf.c.
> > The test result is collected by command:
> >
> > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
> > -cff -n2 --no-huge
> >
> > Note that test results show that with or without hugepage, the LCC miss
> > rate remains the same. So I will just show --no-huge config.
> >
> > With 8 cores, the LLC miss rate is OK:
> >
> > LLC-load-misses  26750
> > LLC-loads  93979233
> > LLC-store-misses  432263
> > LLC-stores  69954746
> >
> > That is 0.028% of load miss and 0.62% of store miss.
> >
> > With 16 cores, the LLC miss rate is very high:
> >
> > LLC-load-misses  70263520
> > LLC-loads  143807657
> > LLC-store-misses  23115990
> > LLC-stores  63692854
> >
> > That is 48.9% load miss and 36.3% store miss.
> >
> > Most of the load miss happens at first line of rte_distributor_poll_pkt.
> > Most of the store miss happens at ... I don't know, because perf record on
> > LLC-store-misses brings down my machine.
> >
> > It's not so straightforward to me how could this happen: 8 core fine, but
> > 16 cores very bad.
> > My guess is that 16 cores bring in more QPI transaction between sockets?
> > Or 16 cores bring a different LLC access pattern?
> >
> > So I tried to reduce the padding inside union rte_distributor_buffer from
> > 3 cachelines to 1 cacheline.
> >
> > -     char pad[CACHE_LINE_SIZE*3];
> > +    char pad[CACHE_LINE_SIZE];
> >
> > And it does have a obvious result:
> >
> > LLC-load-misses  53159968
> > LLC-loads  167756282
> > LLC-store-misses  29012799
> > LLC-stores  63352541
> >
> > Now it is 31.69% of load miss, and 45.79% of store miss.
> >
> > It lows down the load miss rate, but raises the store miss rate.
> > Both numbers are still very high, sadly.
> > But the bright side is that it decrease the Time per burst and time per
> > packet.
> >
> > The original version has:
> > === Performance test of distributor ===
> > Time per burst:  8013
> > Time per packet: 250
> >
> > And the patched ver has:
> > === Performance test of distributor ===
> > Time per burst:  6834
> > Time per packet: 213
> >
> >
> > I tried a couple of other tricks. Such as adding more idle loops
> > in rte_distributor_get_pkt,
> > and making the rte_distributor_buffer thread_local to each worker core.
> > But none of this trick
> > has any noticeable outcome. These failures make me tend to believe the
> > high LLC miss rate
> > is related to QPI or NUMA. But my machine is not able to perf on uncore
> > QPI events so this
> > cannot be approved.
> >
> >
> > I cannot draw any conclusion or reveal the root cause after all. But I
> > suggest a further study on the performance bottleneck so as to find a good
> > solution.
> >
> > thx &
> > rgds,
> > -qinglai
> >
> >


More information about the dev mailing list