[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Wiles, Keith keith.wiles at intel.com
Tue Sep 11 16:20:00 CEST 2018

> On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688 at gmail.com> wrote:
> Hi,
> I am trying to write a DPDK application and finding it difficult to achieve
> line rate on a 10G NIC. I feel this has something to do with CPU caches and
> related optimizations, and would be grateful if someone can point me to the
> right direction.
> I wrap every rte_mbuf into my own structure say, my_packet. Here is
> my_packet's structure declaration:
> ```
> struct my_packet {
> struct rte_mbuf * m;
> uint16_t tag1;
> uint16_t tag2;
> }
> ```

The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf.

For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-(

Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV

You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall.

Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. 

> During initialization, I reserve a mempool of type struct my_packet with
> 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> for freeing I put them back into pool as bursts.
> So there is a loop in the datapath which touches each of these my_packet's
> tag to make a decision.
> ```
> for (i = 0; i < pkt_count; i++) {
>    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> **)&val[i]) < 0) {
>    }
> }
> ```
> Based on my tests, &(my_packet->tag1) is the cause for not letting me
> achieve line rate in the fast path. I say this because if I hardcode the
> tag1's value, I am able to achieve line rate. As a workaround, I tried to
> use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> my_packet(s) from my_packet[] array, but nothing seems to boost the
> throughput.
> I tried to play with the flags in rte_mempool_create() function call:
> -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> settles to ~8.5GB after 20 or 30 seconds.
> -- NO FLAG gives 7.7G
> I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> Any help or pointers are highly appreciated.
> Thanks,
> Arvind


More information about the users mailing list