[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Pierre Laurent pierre at emutex.com
Tue Sep 11 21:36:42 CEST 2018

Previous message: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Next message: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Can I suggest a few steps for investigating more ?

First, verify that the L1 cache is really the suspect one. this can be 
done simply with perf utility and the counter L1-dcache-load-misses. the 
simplest tool is "perf" which is part of linux-tools packages

|||$ apt-get install linux-tools-common linux-tools-generic 
linux-tools-`uname -r` |

$ sudo perf stat -d -d -d ./build/rxtx
EAL: Detected 12 lcore(s)
....

^C

  Performance counter stats for './build/rxtx':

        1413.787490      task-clock (msec)         # 0.923 CPUs utilized
                 18      context-switches          # 0.013 K/sec
                  4      cpu-migrations            # 0.003 K/sec
                238      page-faults               # 0.168 K/sec
      4,436,904,124      cycles                    # 3.138 
GHz                      (32.67%)
      3,888,094,815      stalled-cycles-frontend   # 87.63% frontend 
cycles idle     (32.94%)
        237,378,065      instructions              # 0.05  insn per cycle
                                                   # 16.38  stalled 
cycles per insn  (39.73%)
         76,863,834      branches                  # 54.367 
M/sec                    (40.01%)
            101,550      branch-misses             # 0.13% of all 
branches          (40.30%)
         94,805,298      L1-dcache-loads           # 67.058 
M/sec                    (39.77%)
        263,530,291      L1-dcache-load-misses     # 277.97% of all 
L1-dcache hits    (13.77%)
            425,934      LLC-loads                 # 0.301 
M/sec                    (13.60%)
            181,295      LLC-load-misses           # 42.56% of all 
LL-cache hits     (20.21%)
    <not supported> L1-icache-loads
            775,365 L1-icache-load-misses (26.71%)
         70,580,827      dTLB-loads                # 49.923 
M/sec                    (25.46%)
              2,474      dTLB-load-misses          # 0.00% of all dTLB 
cache hits   (13.01%)
                277      iTLB-loads                # 0.196 
K/sec                    (13.01%)
                994      iTLB-load-misses          # 358.84% of all iTLB 
cache hits   (19.52%)
    <not supported> L1-dcache-prefetches
              7,204      L1-dcache-prefetch-misses # 0.005 
M/sec                    (26.03%)

        1.531809863 seconds time elapsed

One of the common mistakes is to have excessively large tx and rx 
queues, which in turn helps trigger excessively large bursts. Your L1 
cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512 
cache lines is not much ..... If the bursts you are processing happen to 
be more than approx 128 buffers, then you will be trashing the cache 
when running your loop. I would notice that you use a pool of 8192 of 
your buffers, and if you use them round-robin, then you have a perfect 
recipe for cache trashing. If so, then prefetch would help.

rte_hash_lookup looks into cache lines too (at least 3 per successful 
invoke). If you use the same key, then rte_hash_lookup will look into 
the same cache lines. if your keys are randomly distributed, then it is 
another recipe for cache trashing.

It is not clear from your descriptions if the core which reads the 
bursts from dpdk PMD is the same than the core which does the 
processing. if a core touch your buffers (e.g. tag1), and then you pass 
the buffer to another core, than you get LLC coherency overheads, which 
would also trigger LLC-load-misses (which you can detect through perf 
output above)

It seems you have this type of processor (codename sandybridge, 6 cores, 
hyperthread is enabled)

https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI

Can you double check that your application run with the right core 
assignment  ? Since hyperthreading is enabled, you should not use 0 
(plenty functions for the linux kernel run on core 0) nor core 6 (which 
is the same hardware than core 0) and make sure the hyperthread 
corresponding to the core you are running is not used either. You can 
get the CPU<-->Core assignment with lscpu tool

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,0,0,0,,0,0,0,0
7,1,0,0,,1,1,1,0
8,2,0,0,,2,2,2,0
9,3,0,0,,3,3,3,0
10,4,0,0,,4,4,4,0
11,5,0,0,,5,5,5,0

If you do not need hyperthreading, and if L1 cache is your bottleneck, 
you might need to disable hyperthreading and get 64K bytes L1 cache per 
core. If you really need hyperthreading, then use less cache in your 
code by better tuning the buffer pool sizes.

SW prefetch is quite difficult to use efficiently. There are 4 different 
hardware prefetcher with different algorithms (adjacent cache lines, 
stride access ...) where the use of prefetch instruction is unnecessary, 
and there is a hw limit of about 8 pending L1 data cache misses 
(sometimes documented as 5, sometimes documented as 10 ..). This creates 
serious burden of software complexity to abide by the hw rules.

https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors 
. Just verify the hardware prefetchers are all enabled thru msr 0x1A4. 
Some bios might have created a different setup.

On 11/09/18 19:07, Stephen Hemminger wrote:
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688 at gmail.com> wrote:
>
>> If I don't do any processing, I easily get 10G. It is only when I access
>> the tag when the throughput drops.
>> What confuses me is if I use the following snippet, it works at line rate.
>>
>> ```
>> int temp_key = 1; // declared outside of the for loop
>>
>> for (i = 0; i < pkt_count; i++) {
>>      if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>>      }
>> }
>> ```
>>
>> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
>> fall in throughput (which in a way confirms the issue is due to cache
>> misses).
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>

------
This email has been scanned for spam and malware by The Email Laundry.

Previous message: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Next message: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the users mailing list