[RFC v2] non-temporal memcpy

Konstantin Ananyev konstantin.v.ananyev at yandex.ru
Fri Jul 29 11:21:41 CEST 2022


28/07/2022 11:51, Morten Brørup пишет:
> From: Stanisław Kardach [mailto:kda at semihalf.com]
> Sent: Thursday, 28 July 2022 00.02
>> On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <Honnappa.Nagarahalli at arm.com> wrote:
>>
>>>>>>> Yes, x86 needs 16B alignment for NT load/stores But that's
>>>> supposed
>>>>>> to be arch
>>>>>>> specific limitation, that we probably want to hide, no?
>>>>>
>>>>> Correct. However, optional hints for optimization purposes will be
>>>> available.
>>>>> And it is up to the architecture specific implementation to make the
>>>> best use
>>>>> of these hints, or just ignore them.
>>>>>
>>>>>>> Inside the function can check alignment of both src and dst and
>>>>>> decide should it
>>>>>>> use NT load/store instructions or just do normal copy.
>>>>>> IMO, the normal copy should not be done by this API under any
>>>>>> conditions. Why not let the application call memcpy/rte_memcpy
>>>>>> when the NT copy is not applicable? It helps the programmer to
>>>> understand
>>>>>> and debug the issues much easier.
>>>>>
>>>>> Yes, the programmer must choose between normal memcpy() and non-
>>>>> temporal rte_memcpy_nt(). I am offering new functions, not modifying
>>>>> memcpy() or rte_memcpy().
>>>>>
>>>>> And rte_memcpy_nt() will silently fall back to normal memcpy() if
>>>> non-
>>>>> temporal copying is unavailable, e.g. on POWER and RISC-V
>>>> architectures,
>>>>> which don't have NT load/store instructions.
>>>> I am talking about a scenario where the application is being ported
>>>> between architectures. Not everyone knows about the capabilities of
>>>> the architecture. It is better to indicate upfront (ex: compilation
>>>> failures) that a certain feature is not supported on the target
>>>> architecture rather than the user having to discover through painful
>>>> debugging.
>>>
>>> I'm considering rte_memcpy_nt() a performance optimized variant of
>>> memcpy(), where the performance gain is less cache pollution. Thus, silent
>>> fallback to memcpy() should suffice.
>>>
>>> Other architecture differences also affect DPDK performance; the inability to
>>> perform non-temporal load/store just one more to the (undocumented) list.
>>>
>>> Failing at build time if NT load/store is unavailable by the architecture would
>>> prevent the function from being used by other DPDK libraries, e.g. by the
>>> rte_pktmbuf_copy() function used by the pdump library.
>> The other libraries in DPDK need to provide NT versions as the libraries need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()
> 
> Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. Some uses of rte_pktmbuf_copy() may benefit from having the copied data in cache.
> 
> But there is a ripple effect:
> 
> It is also my intention to improve the pdump and pcapng libraries by using rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally benefit from not polluting the cache.
> 
> So the underlying rte_memcpy_nt() function needs a fallback if the architecture doesn't support non-temporal memory copy, now that the pdump and pcapng libraries depend on it.
> 
> Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an application fails to build if the application developer tries to use rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this:
> 
> + #ifdef RTE_CPUFLAG_xxx
>    p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen);
> + #else
>    p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> + #endif
> 
> Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to check for it everywhere.

+1 here.
If we going to introduce rte_memcpy_nt(), I think it better be
'best effort' approach - if it can do NT, great, if not
just fall back to normal copy.

> 
> The developer using the pdump library will not know if the fallback is inside rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy().
> 
>>
>>>
>>> I don't oppose to your idea, I just don't have any idea how to reasonably
>>> implement it. So I'm trying to defend why it is not important.
>> I am suggesting that the applications could implement #ifdef depending on the architecture.
>> I assume that it would be a pre-processor flag defined (or not) on DPDK side and application doing #ifdef based on it?
>>
>> Another way to achieve this would be to use #warning directive (see [1]) inside DPDK when the generic fallback is taken.
>>
>> Also isn't the argument on memcpy_nt capability query not a more general one, that is how would/should application query DPDK's capabilities when run or compiled?
> 
> Good point! You just solved this part of the puzzle, Stanislaw:
> 
> The ability to perform non-temporal memory load/store is a CPU feature.
> 
> Applications that need to know if non-temporal memory access is available should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on x86 architecture. This works both at runtime and at compile time.
> 
>>
>> [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html
> 



More information about the dev mailing list