[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Stephen Hemminger stephen at networkplumber.org
Tue Sep 11 21:12:54 CEST 2018


On Tue, 11 Sep 2018 13:39:24 -0500
Arvind Narayanan <webguru2688 at gmail.com> wrote:

> Stephen, thanks!
> 
> That is it! Not sure if there is any workaround.
> 
> So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s)
> from its pre-allocated mempool, and then (bulk) enqueues it into a
> rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the
> data pointed by the ring's element (i.e. my_packet->tag1), this memory
> access latency issue is seen. I cannot advance the prefetch any earlier. Is
> there any clever workaround (or hack) to overcome this issue other than
> using the same core for all the functions? For e.g. can I can prefetch the
> packets in core 0 for core 1's cache (could be a dumb question!)?
> 
> Thanks,
> Arvind
> 
> On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger <
> stephen at networkplumber.org> wrote:  
> 
> > On Tue, 11 Sep 2018 12:18:42 -0500
> > Arvind Narayanan <webguru2688 at gmail.com> wrote:
> >  
> > > If I don't do any processing, I easily get 10G. It is only when I access
> > > the tag when the throughput drops.
> > > What confuses me is if I use the following snippet, it works at line  
> > rate.  
> > >
> > > ```
> > > int temp_key = 1; // declared outside of the for loop
> > >
> > > for (i = 0; i < pkt_count; i++) {
> > >     if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) <  
> > 0) {  
> > >     }
> > > }
> > > ```
> > >
> > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
> > > fall in throughput (which in a way confirms the issue is due to cache
> > > misses).  
> >
> > Your packet data is not in cache.
> > Doing prefetch can help but it is very timing sensitive. If prefetch is
> > done
> > before data is available it won't help. And if prefetch is done just before
> > data is used then there isn't enough cycles to get it from memory to the
> > cache.
> >
> >
> >  

In my experience, if you want performance then don't pass packets between cores.
It is slightly less bad if the core that does the passing does not access the
packet. It is really bad if the handling core writes the packet.

Especially cores with greater cache distance (numa). If you have to then use
cores which share hyper-thread.


More information about the users mailing list