[PATCH] eal: non-temporal memcpy
Mattias Rönnblom
hofors at lysator.liu.se
Mon Oct 10 10:58:57 CEST 2022
On 2022-10-10 09:35, Morten Brørup wrote:
> Mattias, Konstantin, Honnappa, Stephen,
>
> In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.
>
> Now, I am seriously considering this alternative:
>
> Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
>
This is how I've done it in the past, in DPDK applications. That was
both to simplify (and potentially optimize) the code somewhat, and
because I had my doubt there was any actual benefits from using
non-temporal stores for the beginning or the end of the memory block.
That latter reason however, was pure conjecture. I think it would be
great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the
manuals or go find the appropriate CPU expert, to find out if that is true.
More specifically, my question is:
A) Consider a scenario where a core does a regular store against some
cache line, and then pretty much immediately does a non-temporal store
against a different address in the same cache line. How will this cache
line be treated?
B) Consider the same scenario, but where no regular stores preceded (or
followed) the non-temporal store, and the non-temporal stores performed
did not cover the entirety of the cache line.
Scenario A) would be common in the beginning of the copy, in case
there's a header preceding the data, and writing that header
non-temporally might be cumbersome. Scenario B) would common at the end
of the copy. Both assuming copies of memory blocks which are not
cache-line aligned.
> I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a pcap header) followed by packet data.
>
The application *could* use NT stores for the pcap header as well.
I haven't reviewed v3 of your patch, but in some earlier patch you did
not use the movnti instruction to make smaller (< 16 bytes) stores.
> The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33 cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.).
>
> What do you think?
>
For large copies, which I'm guessing is what non-temporal stores are
usually used for, this is hair splitting. For DPDK applications, it
might well be at least somewhat relevant, because such an application
may make an enormous amount of copies, each roughly the size of a packet.
If we had a rte_memcpy_ex() that only cared about copying whole cache
line in a NT manner, the application could add a clflushopt (or the
equivalent) after the copy, flushing the the beginning and end cache
line of the destination buffer.
>
> PS: Non-temporal loads are easy to work with, so don't worry about that.
>
>
> Med venlig hilsen / Kind regards,
> -Morten Brørup
More information about the dev
mailing list