[PATCH] eal: non-temporal memcpy
Bruce Richardson
bruce.richardson at intel.com
Mon Oct 10 11:57:52 CEST 2022
On Mon, Oct 10, 2022 at 10:58:57AM +0200, Mattias Rönnblom wrote:
> On 2022-10-10 09:35, Morten Brørup wrote:
> > Mattias, Konstantin, Honnappa, Stephen,
> >
> > In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.
> >
> > Now, I am seriously considering this alternative:
> >
> > Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> >
>
> This is how I've done it in the past, in DPDK applications. That was both to
> simplify (and potentially optimize) the code somewhat, and because I had my
> doubt there was any actual benefits from using non-temporal stores for the
> beginning or the end of the memory block.
>
> That latter reason however, was pure conjecture. I think it would be great
> if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the manuals or go
> find the appropriate CPU expert, to find out if that is true.
>
> More specifically, my question is:
>
> A) Consider a scenario where a core does a regular store against some cache
> line, and then pretty much immediately does a non-temporal store against a
> different address in the same cache line. How will this cache line be
> treated?
>
> B) Consider the same scenario, but where no regular stores preceded (or
> followed) the non-temporal store, and the non-temporal stores performed did
> not cover the entirety of the cache line.
>
The best reference I am aware of for this for Intel CPUs is section
10.4.6.2 in Vol 1 of the Software Developers Manual[1].
The bit relevant to your scenarios above is:
"If a program specifies a non-temporal store with one of these instruc-
tions and the memory type of the destination region is write back (WB), write through (WT), or write combining
(WC), the processor will do the following:
• If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.
• The non-temporal data is written to memory with WC semantics"
Hope this helps a little.
Regards,
/Bruce
[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#G11.44032
More information about the dev
mailing list