[dpdk-dev] rte_memcpy - fence and stream

Morten Brørup mb at smartsharesystems.com
Thu May 27 17:49:19 CEST 2021


> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, 25 May 2021 11.20
> 
> On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > I am looking at the source for rte_memcpy (this is a discussion only
> > for x86-64)
> >
> > For one of the cases, when aligned correctly, it uses
> >
> > /**
> >  * Copy 64 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> > static __rte_always_inline void
> > rte_mov64(uint8_t *dst, const uint8_t *src)
> > {
> >         __m512i zmm0;
> >
> >         zmm0 = _mm512_loadu_si512((const void *)src);
> >         _mm512_storeu_si512((void *)dst, zmm0);
> > }
> >
> > I had some questions about this:
> >

[snip]

> > 3. Why isn't the code using  stream variants,
> > _mm512_stream_load_si512 and friends?
> > It would not pollute the cache, so should be better - unless
> > the required fence instructions cause a drop in performance?
> >
> Whether the stream variants perform better really depends on what you
> are
> trying to measure. However, in the vast majority of cases a core is
> making
> a copy of data to work on it, in which case having the data not in
> cache is
> going to cause massive stalls on the next access to that data as it's
> fetched from DRAM. Therefore, the best option is to use regular
> instructions meaning that the data is in local cache afterwards, giving
> far
> better performance when the data is actually used.
> 

Good response, Bruce. And you are probably right about most use cases looking like you describe.

I'm certainly no expert on deep x86-64 optimization, but please let me think out loud here...

I can come up with a different scenario: One core is analyzing packet headers, and determines to copy some of the packets in their entirety (e.g. using rte_pktmbuf_copy) for another core to process later, so it enqueues the copies to a ring, which the other core dequeues from.

The first core doesn't care about the packet contents, and the second core will read the copy of the packet much later, because it needs to process the packets in front of the queue first.

Wouldn't this scenario benefit from a rte_memcpy variant that doesn't pollute the cache?

I know that rte_pktmbuf_clone might be better to use in the described scenario, but rte_pktmbuf_copy must be there for a reason - and I guess that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting variant of rte_memcpy.

-Morten


More information about the dev mailing list