[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Wiles, Keith keith.wiles at intel.com
Tue Sep 11 18:52:37 CEST 2018



> On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688 at gmail.com> wrote:
> 
> Keith, thanks!
> 
> My structure's size is 24 bytes, and for that particular for-loop, I do not dereference the rte_mbuf pointer, hence my understanding is it wouldn't require to load 4 cache lines, correct?
> I am only looking at the tags to make a decision and then simply move ahead on the fast-path.

The mbufs does get accessed by the Rx path, so a cacheline is pulled. If you are not accessing the mbuf structure or data then I am not sure what is the problem. The my_packet structure is it starting on a cacheline and have you tried putting each structure on a cacheline using __rte_cache_aligned?

Have you used vtune or some of the other tools in the intel site?
https://software.intel.com/en-us/intel-vtune-amplifier-xe

Not sure about cost or anything. Vtune is a great tool, but for me it does have some learning curve to understand the output.

A Xeon core of this type should be able to forward packets nicely at 10G 64 byte frames. Maybe just do the normal Rx then process, but do not do all of the processing then send it back out like a dumb forwarder. Are the NIC(s) and cores on the same socket, if you have a multi-socket system? Just shooting in the dark here.

Also did you try l2fwd or l3fwd example and see if that app can get to 10G.

> 
> I tried the method suggested in ip_fragmentation example. I tried several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost throughput.
> 
> Here is my CPU info:
> 
> Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> Architecture:          x86_64
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              15360K
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> 
> Just to provide some more context, I isolate the CPU core used from the kernel for fast-path, hence this core is fully dedicated to the fast-path pipeline.
> 
> The only time when the performance bumps from 7.7G to ~8.4G (still not close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or MEMPOOL_F_NO_CACHE_ALIGN.
> 
> Thanks,
> Arvind
> 
> ---------- Forwarded message ---------
> From: Wiles, Keith <keith.wiles at intel.com>
> Date: Tue, Sep 11, 2018 at 9:20 AM
> Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
> To: Arvind Narayanan <webguru2688 at gmail.com>
> Cc: users at dpdk.org <users at dpdk.org>
> 
> 
> 
> 
> > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688 at gmail.com> wrote:
> > 
> > Hi,
> > 
> > I am trying to write a DPDK application and finding it difficult to achieve
> > line rate on a 10G NIC. I feel this has something to do with CPU caches and
> > related optimizations, and would be grateful if someone can point me to the
> > right direction.
> > 
> > I wrap every rte_mbuf into my own structure say, my_packet. Here is
> > my_packet's structure declaration:
> > 
> > ```
> > struct my_packet {
> > struct rte_mbuf * m;
> > uint16_t tag1;
> > uint16_t tag2;
> > }
> > ```
> 
> The only problem you have created is having to pull in another cache line by having to access my_packet structure. The mbuf is highly optimized to limit the number of cache lines required to be pulled into cache for an mbuf. The mbuf structure is split between RX and TX, when doing TX you touch one of the two cache lines the mbuf is contained in and RX you touch the other cache line, at least that is the reason for the order of the members in the mbuf.
> 
> For the most port accessing a packet of data takes about 2-3 cache lines to load into memory. Getting the prefetches far enough in advance to get the cache lines into top level cache is hard to do. In one case if I removed the prefetches the performance increased not decreased. :-(
> 
> Sound like you are hitting this problem of now loading 4 cache lines and this causes the CPU to stall. One method is to prefetch the packets in a list then prefetch the a number of cache lines in advanced then start processing the first packet of data. In some case I have seen prefetching 3 packets worth of cache lines helps. YMMV
> 
> You did not list processor you are using, but Intel Xeon processors have a limit to the number of outstanding prefetches you can have at a time, I think 8 is the number. Also VPP at fd.io does use this method too in order to prefetch the data and not allow the CPU to stall.
> 
> Look in the examples/ip_fragmentation/main.c and look at the code that prefetches mbufs and data structures. I hope that one helps. 
> 
> > 
> > During initialization, I reserve a mempool of type struct my_packet with
> > 8192 elements. Whenever I form my_packet, I get them in bursts, similarly
> > for freeing I put them back into pool as bursts.
> > 
> > So there is a loop in the datapath which touches each of these my_packet's
> > tag to make a decision.
> > 
> > ```
> > for (i = 0; i < pkt_count; i++) {
> >    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> > **)&val[i]) < 0) {
> >    }
> > }
> > ```
> > 
> > Based on my tests, &(my_packet->tag1) is the cause for not letting me
> > achieve line rate in the fast path. I say this because if I hardcode the
> > tag1's value, I am able to achieve line rate. As a workaround, I tried to
> > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> > my_packet(s) from my_packet[] array, but nothing seems to boost the
> > throughput.
> > 
> > I tried to play with the flags in rte_mempool_create() function call:
> > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> > settles to ~8.5GB after 20 or 30 seconds.
> > -- NO FLAG gives 7.7G
> > 
> > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> > 
> > Any help or pointers are highly appreciated.
> > 
> > Thanks,
> > Arvind
> 
> Regards,
> Keith
> 

Regards,
Keith



More information about the users mailing list