[dpdk-dev] LLC miss in librte_distributor

jigsaw jigsaw at gmail.com
Wed Nov 12 18:11:08 CET 2014


Hi Bruce,

Thanks for your reply.

I agree that to logically divide the distributor functionality is the best
solution.

Meantime I tried some tricks and the result looks good: For same amount of
pkts (1M), the LLC stores and loads decrease 90% percent, and the miss
rates for both decrease to 25%.
The L1 miss rate increase a bit, thought.
Then the combined result is that the time spent decreases 50%.
The main change I made is to use a FIFO to transfer the pkts from
distributor to worker, while the current buf is only used as a signalling
channel. This change has a very obvious effect on saving LLC access.

However, the test is based on the simple test program, rather on DPDK
application. So I will try same tricks on DPDK and see if it has same
effect.
Besides, I need more time to read a few more papers to get it right.

I will try to propose a patch if I manage to get a positive result. It will
take several days coz I'm not fully dedicated to this issue.

I will come back with more details.

BTW, I have another user story: a worker can asking distributor to schedule
a pkt.
It arises in such condition: After processing pkt with tag value 1, the
worker changes it's tag to 2, so the distributor has to be
asked to deliver the pkt with new tag value to proper worker.
I already have the patch ready but I will hold it back until previous patch
is committed.
I need also your comments on this user story.

thx &
rgds,
-ql

On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson <
bruce.richardson at intel.com> wrote:

> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
> > Hi,
> >
> > OK it is now very clear it is due to memory transactions between
> different
> > nodes.
> >
> > The test program is here:
> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
> >
> > The test machine topology is:
> >
> > NUMA node0 CPU(s):     0-7,16-23
> > NUMA node1 CPU(s):     8-15,24-31
> >
> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
> > boost from  0.09% to 33.45%.
> > The LLC cache store miss boost from 0.027% to 50.695%.
> >
> > Clearly the root cause is transaction crossing the node boundary.
> >
> > But then how to resolve this problem is another topic...
> >
> > thx &
> > rgds,
> > -ql
> >
> >
>
> Having traffic cross QPI is always a problem, and there could be a number
> of ways
> to solve it. Probably the best solution is to have multiple NICs with some
> directly connected to each socket, with the packets from each NIC
> processed locally
> on the socket that NIC is connected to.
>
> If that is not possible, then other solutions need to be looked at. E.g.
> For an app
> wanting to use a distributor, I would suggest investigating if two
> distributors
> could be used - one on each socket. Then use a ring to burst-transfer large
> groups of packets from one socket to another and then use the distributor
> locally.
> This would involve far less QPI traffic than using a distributor with
> remote workers.
>
> Regards,
> /Bruce
>
> >
> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw at gmail.com> wrote:
> >
> > > Hi Bruce,
> > >
> > > I noticed that librte_distributor has quite sever LLC miss problem when
> > > running on 16 cores.
> > > While on 8 cores, there's no such problem.
> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> > > cores on 2 sockets.
> > >
> > > The test case is the distributor_perf_autotest, i.e.
> > > in app/test/test_distributor_perf.c.
> > > The test result is collected by command:
> > >
> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores
> ./test
> > > -cff -n2 --no-huge
> > >
> > > Note that test results show that with or without hugepage, the LCC miss
> > > rate remains the same. So I will just show --no-huge config.
> > >
> > > With 8 cores, the LLC miss rate is OK:
> > >
> > > LLC-load-misses  26750
> > > LLC-loads  93979233
> > > LLC-store-misses  432263
> > > LLC-stores  69954746
> > >
> > > That is 0.028% of load miss and 0.62% of store miss.
> > >
> > > With 16 cores, the LLC miss rate is very high:
> > >
> > > LLC-load-misses  70263520
> > > LLC-loads  143807657
> > > LLC-store-misses  23115990
> > > LLC-stores  63692854
> > >
> > > That is 48.9% load miss and 36.3% store miss.
> > >
> > > Most of the load miss happens at first line of
> rte_distributor_poll_pkt.
> > > Most of the store miss happens at ... I don't know, because perf
> record on
> > > LLC-store-misses brings down my machine.
> > >
> > > It's not so straightforward to me how could this happen: 8 core fine,
> but
> > > 16 cores very bad.
> > > My guess is that 16 cores bring in more QPI transaction between
> sockets?
> > > Or 16 cores bring a different LLC access pattern?
> > >
> > > So I tried to reduce the padding inside union rte_distributor_buffer
> from
> > > 3 cachelines to 1 cacheline.
> > >
> > > -     char pad[CACHE_LINE_SIZE*3];
> > > +    char pad[CACHE_LINE_SIZE];
> > >
> > > And it does have a obvious result:
> > >
> > > LLC-load-misses  53159968
> > > LLC-loads  167756282
> > > LLC-store-misses  29012799
> > > LLC-stores  63352541
> > >
> > > Now it is 31.69% of load miss, and 45.79% of store miss.
> > >
> > > It lows down the load miss rate, but raises the store miss rate.
> > > Both numbers are still very high, sadly.
> > > But the bright side is that it decrease the Time per burst and time per
> > > packet.
> > >
> > > The original version has:
> > > === Performance test of distributor ===
> > > Time per burst:  8013
> > > Time per packet: 250
> > >
> > > And the patched ver has:
> > > === Performance test of distributor ===
> > > Time per burst:  6834
> > > Time per packet: 213
> > >
> > >
> > > I tried a couple of other tricks. Such as adding more idle loops
> > > in rte_distributor_get_pkt,
> > > and making the rte_distributor_buffer thread_local to each worker core.
> > > But none of this trick
> > > has any noticeable outcome. These failures make me tend to believe the
> > > high LLC miss rate
> > > is related to QPI or NUMA. But my machine is not able to perf on uncore
> > > QPI events so this
> > > cannot be approved.
> > >
> > >
> > > I cannot draw any conclusion or reveal the root cause after all. But I
> > > suggest a further study on the performance bottleneck so as to find a
> good
> > > solution.
> > >
> > > thx &
> > > rgds,
> > > -qinglai
> > >
> > >
>


More information about the dev mailing list