[dpdk-dev] rte_memcpy - fence and stream

Bruce Richardson bruce.richardson at intel.com
Thu May 27 18:25:19 CEST 2021


On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, 25 May 2021 11.20
> > 
> > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > > I am looking at the source for rte_memcpy (this is a discussion only
> > > for x86-64)
> > >
> > > For one of the cases, when aligned correctly, it uses
> > >
> > > /**
> > >  * Copy 64 bytes from one location to another,
> > >  * locations should not overlap.
> > >  */
> > > static __rte_always_inline void
> > > rte_mov64(uint8_t *dst, const uint8_t *src)
> > > {
> > >         __m512i zmm0;
> > >
> > >         zmm0 = _mm512_loadu_si512((const void *)src);
> > >         _mm512_storeu_si512((void *)dst, zmm0);
> > > }
> > >
> > > I had some questions about this:
> > >
> 
> [snip]
> 
> > > 3. Why isn't the code using  stream variants,
> > > _mm512_stream_load_si512 and friends?
> > > It would not pollute the cache, so should be better - unless
> > > the required fence instructions cause a drop in performance?
> > >
> > Whether the stream variants perform better really depends on what you
> > are
> > trying to measure. However, in the vast majority of cases a core is
> > making
> > a copy of data to work on it, in which case having the data not in
> > cache is
> > going to cause massive stalls on the next access to that data as it's
> > fetched from DRAM. Therefore, the best option is to use regular
> > instructions meaning that the data is in local cache afterwards, giving
> > far
> > better performance when the data is actually used.
> > 
> 
> Good response, Bruce. And you are probably right about most use cases looking like you describe.
> 
> I'm certainly no expert on deep x86-64 optimization, but please let me think out loud here...
> 
> I can come up with a different scenario: One core is analyzing packet headers, and determines to copy some of the packets in their entirety (e.g. using rte_pktmbuf_copy) for another core to process later, so it enqueues the copies to a ring, which the other core dequeues from.
> 
> The first core doesn't care about the packet contents, and the second core will read the copy of the packet much later, because it needs to process the packets in front of the queue first.
> 
> Wouldn't this scenario benefit from a rte_memcpy variant that doesn't pollute the cache?
> 
> I know that rte_pktmbuf_clone might be better to use in the described scenario, but rte_pktmbuf_copy must be there for a reason - and I guess that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting variant of rte_memcpy.
> 

That is indeed a possible scenario, but in that case we would probably want
to differentiate between different levels of cache. While we would not want
the copy to be polluting the local L1 or L2 cache of the core doing the
copy (unless the other core was a hyperthread), we probably would want any
copy to be present in any shared caches, rather than all the way in DRAM.
For Intel platforms and a scenario which you describe, I would actually
recommend using the "ioat" driver copy accelerator if cache pollution is a
concern. In the case of the copies being done in HW, the local cache of a
core would not be polluted, but the copied data could still end up in LLC
due to DDIO.

In terms of memcpy functions, given the number of possibilities of
scenarios, in the absense of compelling data showing a meaningful benefit
for a common scenario, I'd be wary about trying to provide specialized
varients, since we could end up with a lot of them to maintain and tune.

/Bruce


More information about the dev mailing list