[RFC v2] non-temporal memcpy
Mattias Rönnblom
hofors at lysator.liu.se
Tue Aug 9 14:05:12 CEST 2022
On 2022-08-09 11:46, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors at lysator.liu.se]
>> Sent: Sunday, 7 August 2022 22.25
>>
>> On 2022-07-19 17:26, Morten Brørup wrote:
>>> This RFC proposes a set of functions optimized for non-temporal
>> memory copy.
>>>
>>> At this stage, I am asking for feedback on the concept.
>>>
>>> Applications sometimes data to another memory location, which is only
>> used
>>> much later.
>>> In this case, it is inefficient to pollute the data cache with the
>> copied
>>> data.
>>>
>>> An example use case (originating from a real life application):
>>> Copying filtered packets, or the first part of them, into a capture
>> buffer
>>> for offline analysis.
>>>
>>> The purpose of these functions is to achieve a performance gain by
>> not
>>> polluting the cache when copying data.
>>> Although the throughput may be improved by further optimization, I do
>> not
>>> consider througput optimization relevant initially.
>>>
>>> The x86 non-temporal load instructions have 16 byte alignment
>>> requirements [1], while ARM non-temporal load instructions are
>> available with
>>> 4 byte alignment requirements [2].
>>> Both platforms offer non-temporal store instructions with 4 byte
>> alignment
>>> requirements.
>>>
>>
>> I don't think memcpy() functions should have alignment requirements.
>> That's not very practical, and violates the principle of least
>> surprise.
>
> I didn't make the CPUs with these alignment requirements.
>
> However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met.
>
>>
>> Use normal memcpy() for the unaligned parts, and for the whole thing
>> for
>> small sizes (at least on x86).
>>
>
> I'm not going to plunge into some advanced vector programming, so I'm working on an implementation where misalignment is handled by using a bounce buffer (allocated on the stack, which is probably cache hot anyway).
>
>
I don't know for the NT load + NT store case, but for regular load + NT
store, this is trivial. The implementation I've used is 36
straight-forward lines of code.
More information about the dev
mailing list