[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Arvind Narayanan webguru2688 at gmail.com
Tue Sep 11 17:42:53 CEST 2018

Keith, thanks!

My structure's size is 24 bytes, and for that particular for-loop, I do not
dereference the rte_mbuf pointer, hence my understanding is it wouldn't
require to load 4 cache lines, correct?
I am only looking at the tags to make a decision and then simply move ahead
on the fast-path.

I tried the method suggested in ip_fragmentation example. I tried several
values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput.

Here is my CPU info:

Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Architecture:          x86_64
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi
flexpriority ept vpid xsaveopt dtherm ida arat pln pts

Just to provide some more context, I isolate the CPU core used from the
kernel for fast-path, hence this core is fully dedicated to the fast-path

The only time when the performance bumps from 7.7G to ~8.4G (still not
close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or


---------- Forwarded message ---------
From: Wiles, Keith <keith.wiles at intel.com>
Date: Tue, Sep 11, 2018 at 9:20 AM
Subject: Re: [dpdk-users] How to use software prefetching for custom
structures to increase throughput on the fast path
To: Arvind Narayanan <webguru2688 at gmail.com>
Cc: users at dpdk.org <users at dpdk.org>

> On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688 at gmail.com>
> Hi,
> I am trying to write a DPDK application and finding it difficult to
> line rate on a 10G NIC. I feel this has something to do with CPU caches
> related optimizations, and would be grateful if someone can point me to
> right direction.
> I wrap every rte_mbuf into my own structure say, my_packet. Here is
> my_packet's structure declaration:
> ```
> struct my_packet {
> struct rte_mbuf * m;
> uint16_t tag1;
> uint16_t tag2;
> }
> ```

The only problem you have created is having to pull in another cache line
by having to access my_packet structure. The mbuf is highly optimized to
limit the number of cache lines required to be pulled into cache for an
mbuf. The mbuf structure is split between RX and TX, when doing TX you
touch one of the two cache lines the mbuf is contained in and RX you touch
the other cache line, at least that is the reason for the order of the
members in the mbuf.

For the most port accessing a packet of data takes about 2-3 cache lines to
load into memory. Getting the prefetches far enough in advance to get the
cache lines into top level cache is hard to do. In one case if I removed
the prefetches the performance increased not decreased. :-(

Sound like you are hitting this problem of now loading 4 cache lines and
this causes the CPU to stall. One method is to prefetch the packets in a
list then prefetch the a number of cache lines in advanced then start
processing the first packet of data. In some case I have seen prefetching 3
packets worth of cache lines helps. YMMV

You did not list processor you are using, but Intel Xeon processors have a
limit to the number of outstanding prefetches you can have at a time, I
think 8 is the number. Also VPP at fd.io does use this method too in order
to prefetch the data and not allow the CPU to stall.

Look in the examples/ip_fragmentation/main.c and look at the code that
prefetches mbufs and data structures. I hope that one helps.

> During initialization, I reserve a mempool of type struct my_packet with
> 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> for freeing I put them back into pool as bursts.
> So there is a loop in the datapath which touches each of these my_packet's
> tag to make a decision.
> ```
> for (i = 0; i < pkt_count; i++) {
>    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> **)&val[i]) < 0) {
>    }
> }
> ```
> Based on my tests, &(my_packet->tag1) is the cause for not letting me
> achieve line rate in the fast path. I say this because if I hardcode the
> tag1's value, I am able to achieve line rate. As a workaround, I tried to
> use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> my_packet(s) from my_packet[] array, but nothing seems to boost the
> throughput.
> I tried to play with the flags in rte_mempool_create() function call:
> -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> settles to ~8.5GB after 20 or 30 seconds.
> -- NO FLAG gives 7.7G
> I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> Any help or pointers are highly appreciated.
> Thanks,
> Arvind


More information about the users mailing list