[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Arvind Narayanan
webguru2688 at gmail.com
Tue Sep 11 23:49:07 CEST 2018
Stephen and Pierre, thanks!
Pierre, all points noted.
As per Pierre's suggestions, I performed perf stat on the application. Here
are the results..
Using pktgen default configuration, I send 100M packets on a 10G line.
This is when I use my_packet->tag1 to lookup where the throughput drops to
8.4G/10G:
Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':
47453.031698 task-clock (msec) # 1.830 CPUs
utilized
77 context-switches # 0.002
K/sec
6 cpu-migrations # 0.000
K/sec
868 page-faults # 0.018
K/sec
113,357,285,372 cycles # 2.389
GHz (49.95%)
53,324,793,523 stalled-cycles-frontend # 47.04% frontend
cycles idle (49.95%)
27,161,539,189 stalled-cycles-backend # 23.96% backend cycles
idle (49.96%)
191,560,395,309 instructions # 1.69 insn per
cycle
# 0.28 stalled cycles
per insn (56.22%)
36,872,293,868 branches # 777.027
M/sec (56.23%)
13,801,124 branch-misses # 0.04% of all
branches (56.24%)
67,524,214,383 L1-dcache-loads # 1422.969
M/sec (56.24%)
1,015,922,260 L1-dcache-load-misses # 1.50% of all
L1-dcache hits (56.26%)
619,670,574 LLC-loads # 13.059
M/sec (56.29%)
82,917 LLC-load-misses # 0.01% of all
LL-cache hits (56.31%)
<not supported>
L1-icache-loads
2,059,915
L1-icache-load-misses (56.30%)
67,641,851,208 dTLB-loads # 1425.448
M/sec (56.29%)
151,760 dTLB-load-misses # 0.00% of all dTLB
cache hits (50.01%)
904 iTLB-loads # 0.019
K/sec (50.01%)
10,309 iTLB-load-misses # 1140.38% of all iTLB
cache hits (50.00%)
<not supported>
L1-dcache-prefetches
528,633,571 L1-dcache-prefetch-misses # 11.140
M/sec (49.97%)
25.929843368 seconds time elapsed
This is when I use a temp_key approach:
Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':
42614.775381 task-clock (msec) # 1.729 CPUs
utilized
71 context-switches # 0.002
K/sec
6 cpu-migrations # 0.000
K/sec
869 page-faults # 0.020
K/sec
99,422,031,536 cycles # 2.333
GHz (49.89%)
43,615,501,744 stalled-cycles-frontend # 43.87% frontend
cycles idle (49.91%)
21,325,495,955 stalled-cycles-backend # 21.45% backend cycles
idle (49.95%)
170,398,414,529 instructions # 1.71 insn per
cycle
# 0.26 stalled cycles
per insn (56.22%)
32,543,342,205 branches # 763.663
M/sec (56.26%)
52,276,245 branch-misses # 0.16% of all
branches (56.30%)
58,855,845,003 L1-dcache-loads # 1381.114
M/sec (56.33%)
1,046,059,603 L1-dcache-load-misses # 1.78% of all
L1-dcache hits (56.34%)
598,557,493 LLC-loads # 14.046
M/sec (56.35%)
84,048 LLC-load-misses # 0.01% of all
LL-cache hits (56.35%)
<not supported>
L1-icache-loads
2,150,306
L1-icache-load-misses (56.33%)
58,942,694,476 dTLB-loads # 1383.152
M/sec (56.29%)
147,013 dTLB-load-misses # 0.00% of all dTLB
cache hits (49.97%)
22,392 iTLB-loads # 0.525
K/sec (49.93%)
5,839 iTLB-load-misses # 26.08% of all iTLB
cache hits (49.90%)
<not supported>
L1-dcache-prefetches
533,602,543 L1-dcache-prefetch-misses # 12.522
M/sec (49.89%)
24.647230934 seconds time elapsed
Not sure if I am understanding it correctly, but there are a lot of
iTLB-load-misses in the lower-throughput perf stat output.
One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much ..... If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>
You raised a very good point here and I think DPDK's writing efficient code
page <https://doc.dpdk.org/guides/prog_guide/writing_efficient_code.html>
could maybe have a section on this topic and help on understanding how this
assignment helps, or maybe I should have missed it if DPDK already has
details about how to assign RX and TX ring sizes. Without knowing compute
load of each part on the data path, people like me just assign random 2^n
values (I blame myself here though).
rte_mbuf pool size is 4096
rx_ring and tx_ring sizes are 1024
rings used to communicate between cores are 8192
my_packet mempool is 8192
MAX _BURST_SIZE for all the loops in the DPDK application is set to 32
It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses (which you can detect through perf output above)
>
I isolate CPUs 1,2,3,4,5 from kernel, thus leaving 0 for kernel operations.
Core 2 (which runs an infinite RX/TX loop) reads the packets from DPDK PMD
and sets tag1 values, while Core 4 lookups the rte_hash table using tag1 as
key and proceeds further.
>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
I had had HT disabled for all the experiments.
Here is the output of lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
Thanks,
Arvind
On Tue, Sep 11, 2018 at 2:36 PM Pierre Laurent <pierre at emutex.com> wrote:
>
> Can I suggest a few steps for investigating more ?
>
> First, verify that the L1 cache is really the suspect one. this can be
> done simply with perf utility and the counter L1-dcache-load-misses. the
> simplest tool is "perf" which is part of linux-tools packages
>
> $ apt-get install linux-tools-common linux-tools-generic
> linux-tools-`uname -r`
>
> $ sudo perf stat -d -d -d ./build/rxtx
> EAL: Detected 12 lcore(s)
> ....
>
> ^C
>
> Performance counter stats for './build/rxtx':
>
> 1413.787490 task-clock (msec) # 0.923 CPUs
> utilized
> 18 context-switches # 0.013
> K/sec
> 4 cpu-migrations # 0.003
> K/sec
> 238 page-faults # 0.168
> K/sec
> 4,436,904,124 cycles # 3.138
> GHz (32.67%)
> 3,888,094,815 stalled-cycles-frontend # 87.63% frontend
> cycles idle (32.94%)
> 237,378,065 instructions # 0.05 insn per
> cycle
> # 16.38 stalled
> cycles per insn (39.73%)
> 76,863,834 branches # 54.367
> M/sec (40.01%)
> 101,550 branch-misses # 0.13% of all
> branches (40.30%)
> 94,805,298 L1-dcache-loads # 67.058
> M/sec (39.77%)
> 263,530,291 L1-dcache-load-misses # 277.97% of all
> L1-dcache hits (13.77%)
> 425,934 LLC-loads # 0.301
> M/sec (13.60%)
> 181,295 LLC-load-misses # 42.56% of all
> LL-cache hits (20.21%)
> <not supported>
> L1-icache-loads
> 775,365
> L1-icache-load-misses (26.71%)
> 70,580,827 dTLB-loads # 49.923
> M/sec (25.46%)
> 2,474 dTLB-load-misses # 0.00% of all dTLB
> cache hits (13.01%)
> 277 iTLB-loads # 0.196
> K/sec (13.01%)
> 994 iTLB-load-misses # 358.84% of all iTLB
> cache hits (19.52%)
> <not supported>
> L1-dcache-prefetches
> 7,204 L1-dcache-prefetch-misses # 0.005
> M/sec (26.03%)
>
> 1.531809863 seconds time elapsed
>
>
> One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much ..... If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>
> rte_hash_lookup looks into cache lines too (at least 3 per successful
> invoke). If you use the same key, then rte_hash_lookup will look into the
> same cache lines. if your keys are randomly distributed, then it is another
> recipe for cache trashing.
>
>
> It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses (which you can detect through perf output above)
>
>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
> $ lscpu -p
> # The following is the parsable format, which can be fed to other
> # programs. Each different item in every column has an unique ID
> # starting from zero.
> # CPU,Core,Socket,Node,,L1d,L1i,L2,L3
> 0,0,0,0,,0,0,0,0
> 1,1,0,0,,1,1,1,0
> 2,2,0,0,,2,2,2,0
> 3,3,0,0,,3,3,3,0
> 4,4,0,0,,4,4,4,0
> 5,5,0,0,,5,5,5,0
> 6,0,0,0,,0,0,0,0
> 7,1,0,0,,1,1,1,0
> 8,2,0,0,,2,2,2,0
> 9,3,0,0,,3,3,3,0
> 10,4,0,0,,4,4,4,0
> 11,5,0,0,,5,5,5,0
> If you do not need hyperthreading, and if L1 cache is your bottleneck, you
> might need to disable hyperthreading and get 64K bytes L1 cache per core.
> If you really need hyperthreading, then use less cache in your code by
> better tuning the buffer pool sizes.
>
>
> SW prefetch is quite difficult to use efficiently. There are 4 different
> hardware prefetcher with different algorithms (adjacent cache lines, stride
> access ...) where the use of prefetch instruction is unnecessary, and there
> is a hw limit of about 8 pending L1 data cache misses (sometimes documented
> as 5, sometimes documented as 10 ..). This creates serious burden of
> software complexity to abide by the hw rules.
>
>
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
> . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some
> bios might have created a different setup.
>
>
>
> On 11/09/18 19:07, Stephen Hemminger wrote:
>
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688 at gmail.com> <webguru2688 at gmail.com> wrote:
>
>
> If I don't do any processing, I easily get 10G. It is only when I access
> the tag when the throughput drops.
> What confuses me is if I use the following snippet, it works at line rate.
>
> ```
> int temp_key = 1; // declared outside of the for loop
>
> for (i = 0; i < pkt_count; i++) {
> if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
> }
> }
> ```
>
> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> fall in throughput (which in a way confirms the issue is due to cache
> misses).
>
>
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>
>
>
>
> ------
> This email has been scanned for spam and malware by The Email Laundry.
>
>
More information about the users
mailing list