[PATCH v4] eal: non-temporal memcpy
Thomas Monjalon
thomas at monjalon.net
Mon Jul 31 14:14:08 CEST 2023
Hello,
What's the status of this feature?
10/10/2022 08:46, Morten Brørup:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
>
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
>
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
>
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
>
> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
>
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
>
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
>
> Implementation notes:
>
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
>
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
>
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
>
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
>
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
>
> v4
> * Also ignore the warning for clang int the workaround for
> _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
>
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
> use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
> rte_memcpy_nt_15_or_less_s16a() are not public,
> so remove __rte_internal from them. It also affects the documentation
> for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
> HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
> in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
> #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
>
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
> copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
> method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
> so omit it when using CLANG.
>
> Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> ---
> app/test/test_memcpy.c | 65 +-
> app/test/test_memcpy_perf.c | 187 ++--
> lib/eal/include/generic/rte_memcpy.h | 127 +++
> lib/eal/x86/include/rte_memcpy.h | 1238 ++++++++++++++++++++++++++
> lib/mbuf/rte_mbuf.c | 77 ++
> lib/mbuf/rte_mbuf.h | 32 +
> lib/mbuf/version.map | 1 +
> lib/pcapng/rte_pcapng.c | 3 +-
> lib/pdump/rte_pdump.c | 6 +-
> 9 files changed, 1645 insertions(+), 91 deletions(-)
More information about the dev
mailing list