[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Arvind Narayanan webguru2688 at gmail.com
Tue Sep 11 23:49:07 CEST 2018


Stephen and Pierre, thanks!
Pierre, all points noted.

As per Pierre's suggestions, I performed perf stat on the application. Here
are the results..

Using pktgen default configuration, I send 100M packets on a 10G line.

This is when I use my_packet->tag1 to lookup where the throughput drops to
8.4G/10G:

 Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':

      47453.031698      task-clock (msec)         #    1.830 CPUs
utilized
                77      context-switches          #    0.002
K/sec
                 6      cpu-migrations            #    0.000
K/sec
               868      page-faults               #    0.018
K/sec
   113,357,285,372      cycles                    #    2.389
GHz                      (49.95%)
    53,324,793,523      stalled-cycles-frontend   #   47.04% frontend
cycles idle     (49.95%)
    27,161,539,189      stalled-cycles-backend    #   23.96% backend cycles
idle      (49.96%)
   191,560,395,309      instructions              #    1.69  insn per
cycle
                                                  #    0.28  stalled cycles
per insn  (56.22%)
    36,872,293,868      branches                  #  777.027
M/sec                    (56.23%)
        13,801,124      branch-misses             #    0.04% of all
branches          (56.24%)
    67,524,214,383      L1-dcache-loads           # 1422.969
M/sec                    (56.24%)
     1,015,922,260      L1-dcache-load-misses     #    1.50% of all
L1-dcache hits    (56.26%)
       619,670,574      LLC-loads                 #   13.059
M/sec                    (56.29%)
            82,917      LLC-load-misses           #    0.01% of all
LL-cache hits     (56.31%)
   <not supported>
L1-icache-loads
         2,059,915
L1-icache-load-misses                                         (56.30%)
    67,641,851,208      dTLB-loads                # 1425.448
M/sec                    (56.29%)
           151,760      dTLB-load-misses          #    0.00% of all dTLB
cache hits   (50.01%)
               904      iTLB-loads                #    0.019
K/sec                    (50.01%)
            10,309      iTLB-load-misses          # 1140.38% of all iTLB
cache hits   (50.00%)
   <not supported>
L1-dcache-prefetches
       528,633,571      L1-dcache-prefetch-misses #   11.140
M/sec                    (49.97%)

      25.929843368 seconds time elapsed




This is when I use a temp_key approach:

 Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4
-- -p 3':

      42614.775381      task-clock (msec)         #    1.729 CPUs
utilized
                71      context-switches          #    0.002
K/sec
                 6      cpu-migrations            #    0.000
K/sec
               869      page-faults               #    0.020
K/sec
    99,422,031,536      cycles                    #    2.333
GHz                      (49.89%)
    43,615,501,744      stalled-cycles-frontend   #   43.87% frontend
cycles idle     (49.91%)
    21,325,495,955      stalled-cycles-backend    #   21.45% backend cycles
idle      (49.95%)
   170,398,414,529      instructions              #    1.71  insn per
cycle
                                                  #    0.26  stalled cycles
per insn  (56.22%)
    32,543,342,205      branches                  #  763.663
M/sec                    (56.26%)
        52,276,245      branch-misses             #    0.16% of all
branches          (56.30%)
    58,855,845,003      L1-dcache-loads           # 1381.114
M/sec                    (56.33%)
     1,046,059,603      L1-dcache-load-misses     #    1.78% of all
L1-dcache hits    (56.34%)
       598,557,493      LLC-loads                 #   14.046
M/sec                    (56.35%)
            84,048      LLC-load-misses           #    0.01% of all
LL-cache hits     (56.35%)
   <not supported>
L1-icache-loads
         2,150,306
L1-icache-load-misses                                         (56.33%)
    58,942,694,476      dTLB-loads                # 1383.152
M/sec                    (56.29%)
           147,013      dTLB-load-misses          #    0.00% of all dTLB
cache hits   (49.97%)
            22,392      iTLB-loads                #    0.525
K/sec                    (49.93%)
             5,839      iTLB-load-misses          #   26.08% of all iTLB
cache hits   (49.90%)
   <not supported>
L1-dcache-prefetches
       533,602,543      L1-dcache-prefetch-misses #   12.522
M/sec                    (49.89%)

      24.647230934 seconds time elapsed


Not sure if I am understanding it correctly, but there are a lot of
iTLB-load-misses in the lower-throughput perf stat output.

One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much .....  If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>

You raised a very good point here and I think DPDK's writing efficient code
page <https://doc.dpdk.org/guides/prog_guide/writing_efficient_code.html>
could maybe have a section on this topic and help on understanding how this
assignment helps, or maybe I should have missed it if DPDK already has
details about how to assign RX and TX ring sizes. Without knowing compute
load of each part on the data path, people like me just assign random 2^n
values (I blame myself here though).

rte_mbuf pool size is 4096
rx_ring and tx_ring sizes are 1024
rings used to communicate between cores are 8192
my_packet mempool is 8192
MAX _BURST_SIZE for all the loops in the DPDK application is set to 32

It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses  (which you can detect through perf output above)
>

I isolate CPUs 1,2,3,4,5 from kernel, thus leaving 0 for kernel operations.
Core 2 (which runs an infinite RX/TX loop) reads the packets from DPDK PMD
and sets tag1 values, while Core 4 lookups the rte_hash table using tag1 as
key and proceeds further.

>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment  ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
I had had HT disabled for all the experiments.

Here is the output of lscpu -p

# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0

Thanks,
Arvind

On Tue, Sep 11, 2018 at 2:36 PM Pierre Laurent <pierre at emutex.com> wrote:

>
> Can I suggest a few steps for investigating more ?
>
> First, verify that the L1 cache is really the suspect one. this can be
> done simply with perf utility and the counter L1-dcache-load-misses. the
> simplest tool is "perf" which is part of linux-tools packages
>
> $ apt-get install linux-tools-common linux-tools-generic
> linux-tools-`uname -r`
>
> $ sudo perf stat -d -d -d ./build/rxtx
> EAL: Detected 12 lcore(s)
> ....
>
> ^C
>
>  Performance counter stats for './build/rxtx':
>
>        1413.787490      task-clock (msec)         #    0.923 CPUs
> utilized
>                 18      context-switches          #    0.013
> K/sec
>                  4      cpu-migrations            #    0.003
> K/sec
>                238      page-faults               #    0.168
> K/sec
>      4,436,904,124      cycles                    #    3.138
> GHz                      (32.67%)
>      3,888,094,815      stalled-cycles-frontend   #   87.63% frontend
> cycles idle     (32.94%)
>        237,378,065      instructions              #    0.05  insn per
> cycle
>                                                   #   16.38  stalled
> cycles per insn  (39.73%)
>         76,863,834      branches                  #   54.367
> M/sec                    (40.01%)
>            101,550      branch-misses             #    0.13% of all
> branches          (40.30%)
>         94,805,298      L1-dcache-loads           #   67.058
> M/sec                    (39.77%)
>        263,530,291      L1-dcache-load-misses     #  277.97% of all
> L1-dcache hits    (13.77%)
>            425,934      LLC-loads                 #    0.301
> M/sec                    (13.60%)
>            181,295      LLC-load-misses           #   42.56% of all
> LL-cache hits     (20.21%)
>    <not supported>
> L1-icache-loads
>            775,365
> L1-icache-load-misses                                         (26.71%)
>         70,580,827      dTLB-loads                #   49.923
> M/sec                    (25.46%)
>              2,474      dTLB-load-misses          #    0.00% of all dTLB
> cache hits   (13.01%)
>                277      iTLB-loads                #    0.196
> K/sec                    (13.01%)
>                994      iTLB-load-misses          #  358.84% of all iTLB
> cache hits   (19.52%)
>    <not supported>
> L1-dcache-prefetches
>              7,204      L1-dcache-prefetch-misses #    0.005
> M/sec                    (26.03%)
>
>        1.531809863 seconds time elapsed
>
>
> One of the common mistakes is to have excessively large tx and rx queues,
> which in turn helps trigger excessively large bursts. Your L1 cache is 32K,
> that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is
> not much .....  If the bursts you are processing happen to be more than
> approx 128 buffers, then you will be trashing the cache when running your
> loop. I would notice that you use a pool of 8192 of your buffers, and if
> you use them round-robin, then you have a perfect recipe for cache
> trashing. If so, then prefetch would help.
>
> rte_hash_lookup looks into cache lines too (at least 3 per successful
> invoke). If you use the same key, then rte_hash_lookup will look into the
> same cache lines. if your keys are randomly distributed, then it is another
> recipe for cache trashing.
>
>
> It is not clear from your descriptions if the core which reads the bursts
> from dpdk PMD is the same than the core which does the processing. if a
> core touch your buffers (e.g. tag1), and then you pass the buffer to
> another core, than you get LLC coherency overheads, which would also
> trigger LLC-load-misses  (which you can detect through perf output above)
>
>
> It seems you have this type of processor (codename sandybridge, 6 cores,
> hyperthread is enabled)
>
>
> https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
>
> Can you double check that your application run with the right core
> assignment  ? Since hyperthreading is enabled, you should not use 0 (plenty
> functions for the linux kernel run on core 0) nor core 6 (which is the same
> hardware than core 0) and make sure the hyperthread corresponding to the
> core you are running is not used either. You can get the CPU<-->Core
> assignment with lscpu tool
>
> $ lscpu -p
> # The following is the parsable format, which can be fed to other
> # programs. Each different item in every column has an unique ID
> # starting from zero.
> # CPU,Core,Socket,Node,,L1d,L1i,L2,L3
> 0,0,0,0,,0,0,0,0
> 1,1,0,0,,1,1,1,0
> 2,2,0,0,,2,2,2,0
> 3,3,0,0,,3,3,3,0
> 4,4,0,0,,4,4,4,0
> 5,5,0,0,,5,5,5,0
> 6,0,0,0,,0,0,0,0
> 7,1,0,0,,1,1,1,0
> 8,2,0,0,,2,2,2,0
> 9,3,0,0,,3,3,3,0
> 10,4,0,0,,4,4,4,0
> 11,5,0,0,,5,5,5,0
> If you do not need hyperthreading, and if L1 cache is your bottleneck, you
> might need to disable hyperthreading and get 64K bytes L1 cache per core.
> If you really need hyperthreading, then use less cache in your code by
> better tuning the buffer pool sizes.
>
>
> SW prefetch is quite difficult to use efficiently. There are 4 different
> hardware prefetcher with different algorithms (adjacent cache lines, stride
> access ...) where the use of prefetch instruction is unnecessary, and there
> is a hw limit of about 8 pending L1 data cache misses (sometimes documented
> as 5, sometimes documented as 10 ..). This creates serious burden of
> software complexity to abide by the hw rules.
>
>
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
> . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some
> bios might have created a different setup.
>
>
>
> On 11/09/18 19:07, Stephen Hemminger wrote:
>
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688 at gmail.com> <webguru2688 at gmail.com> wrote:
>
>
> If I don't do any processing, I easily get 10G. It is only when I access
> the tag when the throughput drops.
> What confuses me is if I use the following snippet, it works at line rate.
>
> ```
> int temp_key = 1; // declared outside of the for loop
>
> for (i = 0; i < pkt_count; i++) {
>     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>     }
> }
> ```
>
> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> fall in throughput (which in a way confirms the issue is due to cache
> misses).
>
>
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>
>
>
>
> ------
> This email has been scanned for spam and malware by The Email Laundry.
>
>


More information about the users mailing list