[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Arvind Narayanan webguru2688 at gmail.com
Tue Sep 11 19:18:42 CEST 2018


If I don't do any processing, I easily get 10G. It is only when I access
the tag when the throughput drops.
What confuses me is if I use the following snippet, it works at line rate.

```
int temp_key = 1; // declared outside of the for loop

for (i = 0; i < pkt_count; i++) {
    if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
    }
}
```

But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
fall in throughput (which in a way confirms the issue is due to cache
misses).

__rte_cache_aligned may not be required as the mempool from where I pull
the pre-allocated structs already cache aligns them. But let me try putting
it as well to the struct to make sure.

Yes, I did come across VTune, and there is a free trial period which I
guess would help me confirm that it is due the cache misses.

l2fwd and l3fwd easily achieves 10G. :(

Thanks,
Arvind

On Tue, Sep 11, 2018 at 11:52 AM Wiles, Keith <keith.wiles at intel.com> wrote:

>
>
> > On Sep 11, 2018, at 10:42 AM, Arvind Narayanan <webguru2688 at gmail.com>
> wrote:
> >
> > Keith, thanks!
> >
> > My structure's size is 24 bytes, and for that particular for-loop, I do
> not dereference the rte_mbuf pointer, hence my understanding is it wouldn't
> require to load 4 cache lines, correct?
> > I am only looking at the tags to make a decision and then simply move
> ahead on the fast-path.
>
> The mbufs does get accessed by the Rx path, so a cacheline is pulled. If
> you are not accessing the mbuf structure or data then I am not sure what is
> the problem. The my_packet structure is it starting on a cacheline and have
> you tried putting each structure on a cacheline using __rte_cache_aligned?
>
> Have you used vtune or some of the other tools in the intel site?
> https://software.intel.com/en-us/intel-vtune-amplifier-xe
>
> Not sure about cost or anything. Vtune is a great tool, but for me it does
> have some learning curve to understand the output.
>
> A Xeon core of this type should be able to forward packets nicely at 10G
> 64 byte frames. Maybe just do the normal Rx then process, but do not do all
> of the processing then send it back out like a dumb forwarder. Are the
> NIC(s) and cores on the same socket, if you have a multi-socket system?
> Just shooting in the dark here.
>
> Also did you try l2fwd or l3fwd example and see if that app can get to 10G.
>
> >
> > I tried the method suggested in ip_fragmentation example. I tried
> several values of PREFETCH_OFFSET -- 3 to 16, but none helped boost
> throughput.
> >
> > Here is my CPU info:
> >
> > Model name:            Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> > Architecture:          x86_64
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              15360K
> > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
> nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor
> ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
> popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi
> flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> >
> > Just to provide some more context, I isolate the CPU core used from the
> kernel for fast-path, hence this core is fully dedicated to the fast-path
> pipeline.
> >
> > The only time when the performance bumps from 7.7G to ~8.4G (still not
> close to 10G/100%) is when I add flags such as MEMPOOL_F_NO_SPREAD or
> MEMPOOL_F_NO_CACHE_ALIGN.
> >
> > Thanks,
> > Arvind
> >
> > ---------- Forwarded message ---------
> > From: Wiles, Keith <keith.wiles at intel.com>
> > Date: Tue, Sep 11, 2018 at 9:20 AM
> > Subject: Re: [dpdk-users] How to use software prefetching for custom
> structures to increase throughput on the fast path
> > To: Arvind Narayanan <webguru2688 at gmail.com>
> > Cc: users at dpdk.org <users at dpdk.org>
> >
> >
> >
> >
> > > On Sep 11, 2018, at 3:15 AM, Arvind Narayanan <webguru2688 at gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I am trying to write a DPDK application and finding it difficult to
> achieve
> > > line rate on a 10G NIC. I feel this has something to do with CPU
> caches and
> > > related optimizations, and would be grateful if someone can point me
> to the
> > > right direction.
> > >
> > > I wrap every rte_mbuf into my own structure say, my_packet. Here is
> > > my_packet's structure declaration:
> > >
> > > ```
> > > struct my_packet {
> > > struct rte_mbuf * m;
> > > uint16_t tag1;
> > > uint16_t tag2;
> > > }
> > > ```
> >
> > The only problem you have created is having to pull in another cache
> line by having to access my_packet structure. The mbuf is highly optimized
> to limit the number of cache lines required to be pulled into cache for an
> mbuf. The mbuf structure is split between RX and TX, when doing TX you
> touch one of the two cache lines the mbuf is contained in and RX you touch
> the other cache line, at least that is the reason for the order of the
> members in the mbuf.
> >
> > For the most port accessing a packet of data takes about 2-3 cache lines
> to load into memory. Getting the prefetches far enough in advance to get
> the cache lines into top level cache is hard to do. In one case if I
> removed the prefetches the performance increased not decreased. :-(
> >
> > Sound like you are hitting this problem of now loading 4 cache lines and
> this causes the CPU to stall. One method is to prefetch the packets in a
> list then prefetch the a number of cache lines in advanced then start
> processing the first packet of data. In some case I have seen prefetching 3
> packets worth of cache lines helps. YMMV
> >
> > You did not list processor you are using, but Intel Xeon processors have
> a limit to the number of outstanding prefetches you can have at a time, I
> think 8 is the number. Also VPP at fd.io does use this method too in
> order to prefetch the data and not allow the CPU to stall.
> >
> > Look in the examples/ip_fragmentation/main.c and look at the code that
> prefetches mbufs and data structures. I hope that one helps.
> >
> > >
> > > During initialization, I reserve a mempool of type struct my_packet
> with
> > > 8192 elements. Whenever I form my_packet, I get them in bursts,
> similarly
> > > for freeing I put them back into pool as bursts.
> > >
> > > So there is a loop in the datapath which touches each of these
> my_packet's
> > > tag to make a decision.
> > >
> > > ```
> > > for (i = 0; i < pkt_count; i++) {
> > >    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
> > > **)&val[i]) < 0) {
> > >    }
> > > }
> > > ```
> > >
> > > Based on my tests, &(my_packet->tag1) is the cause for not letting me
> > > achieve line rate in the fast path. I say this because if I hardcode
> the
> > > tag1's value, I am able to achieve line rate. As a workaround, I tried
> to
> > > use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
> > > my_packet(s) from my_packet[] array, but nothing seems to boost the
> > > throughput.
> > >
> > > I tried to play with the flags in rte_mempool_create() function call:
> > > -- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
> > > -- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
> > > settles to ~8.5GB after 20 or 30 seconds.
> > > -- NO FLAG gives 7.7G
> > >
> > > I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.
> > >
> > > Any help or pointers are highly appreciated.
> > >
> > > Thanks,
> > > Arvind
> >
> > Regards,
> > Keith
> >
>
> Regards,
> Keith
>
>


More information about the users mailing list