[PATCH] eal: non-temporal memcpy

Mattias Rönnblom hofors at lysator.liu.se
Mon Oct 10 10:58:57 CEST 2022


On 2022-10-10 09:35, Morten Brørup wrote:
> Mattias, Konstantin, Honnappa, Stephen,
> 
> In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.
> 
> Now, I am seriously considering this alternative:
> 
> Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> 

This is how I've done it in the past, in DPDK applications. That was 
both to simplify (and potentially optimize) the code somewhat, and 
because I had my doubt there was any actual benefits from using 
non-temporal stores for the beginning or the end of the memory block.

That latter reason however, was pure conjecture. I think it would be 
great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the 
manuals or go find the appropriate CPU expert, to find out if that is true.

More specifically, my question is:

A) Consider a scenario where a core does a regular store against some 
cache line, and then pretty much immediately does a non-temporal store 
against a different address in the same cache line. How will this cache 
line be treated?

B) Consider the same scenario, but where no regular stores preceded (or 
followed) the non-temporal store, and the non-temporal stores performed 
did not cover the entirety of the cache line.

Scenario A) would be common in the beginning of the copy, in case 
there's a header preceding the data, and writing that header 
non-temporally might be cumbersome. Scenario B) would common at the end 
of the copy. Both assuming copies of memory blocks which are not 
cache-line aligned.

> I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a pcap header) followed by packet data.
> 

The application *could* use NT stores for the pcap header as well.

I haven't reviewed v3 of your patch, but in some earlier patch you did 
not use the movnti instruction to make smaller (< 16 bytes) stores.


> The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33 cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.).
> 
> What do you think?
> 

For large copies, which I'm guessing is what non-temporal stores are 
usually used for, this is hair splitting. For DPDK applications, it 
might well be at least somewhat relevant, because such an application 
may make an enormous amount of copies, each roughly the size of a packet.

If we had a rte_memcpy_ex() that only cared about copying whole cache 
line in a NT manner, the application could add a clflushopt (or the 
equivalent) after the copy, flushing the the beginning and end cache 
line of the destination buffer.

> 
> PS: Non-temporal loads are easy to work with, so don't worry about that.
> 
> 
> Med venlig hilsen / Kind regards,
> -Morten Brørup


More information about the dev mailing list