[PATCH v6 5/7] eal: provide option to use compiler memcpy instead of RTE
David Marchand
david.marchand at redhat.com
Fri Oct 4 09:52:49 CEST 2024
On Fri, Sep 20, 2024 at 12:36 PM Mattias Rönnblom
<mattias.ronnblom at ericsson.com> wrote:
>
> Provide build option to have functions in <rte_memcpy.h> delegate to
> the standard compiler/libc memcpy(), instead of using the various
> custom DPDK, handcrafted, per-architecture rte_memcpy()
> implementations.
>
> A new meson build option 'use_cc_memcpy' is added. By default, the
> traditional, custom DPDK rte_memcpy() implementation is used.
>
> The performance benefits of the custom DPDK rte_memcpy()
> implementations have been diminishing with every compiler release, and
> with current toolchains the use of a custom memcpy() implementation
> may even be a liability.
>
> An additional benefit of this change is that compilers and static
> analysis tools have an easier time detecting incorrect usage of
> rte_memcpy() (e.g., buffer overruns, or overlapping source and
> destination buffers).
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
> Acked-by: Morten Brørup <mb at smartsharesystems.com>
I like this patch and the direction we are taking: stop reinvent
memcpy and rely on compiler to optimize it.
I have some comments on the implementation.
- When I splitted headers in the early days of dpdk, the intention
with arch-specific headers in EAL was to have them include the generic
one, in all cases.
It seems that, over time, x86 rte_memcpy.h (at least) deviated from
this and stopped including generic/rte_memcpy.h...
So in this current patch, I expect every arch specific headers first
include generic/rte_memcpy.h, regardless of any arch-specific define
coming from the configuration.
An additional note on this, ARM32 and ARM64 have their own
implementation in rte_memcpy_32.h resp. rte_memcpy_64.h, and I would
check RTE_USE_CC_MEMCPY in each of them rather than in the top as
ARM32 and ARM64 are like two different arches.
- Now, looking at what was available for arches so far in DPDK:
* ARM was relying by default on compiler implementation, with specific
implementations for ARM32 and ARM64 available (see for more details
below) => possible values (default first) RTE_USE_CC_MEMCPY = true /
false
* loongarch was relying on compiler implementation, with no specific
implementations, => RTE_USE_CC_MEMCPY = true
* ppc was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
* risc was relying on compiler implementation, with no specific
implementations, => RTE_USE_CC_MEMCPY = true
* x86 was relying on arch specific implementation, => RTE_USE_CC_MEMCPY = false
We can't get a unified default value for a meson option and keep
compat for all arches (except maybe introduce a "auto" value).
Plus, disabling RTE_USE_CC_MEMCPY on loongarch and risc makes no
sense, as there was never a specific implementation.
My suggestion is to drop the meson option and instead just set
RTE_USE_CC_MEMCPY in config/$arch/meson.build.
Testers / interested users may edit config/$arch/meson.build on their own.
- Additionnally, ARM people have introduced arch-specific
implementation config options for memcpy in ARM32 resp. ARM64:
RTE_ARCH_ARM_NEON_MEMCPY resp. RTE_ARCH_ARM64_MEMCPY.
RTE_USE_CC_MEMCPY can replace those two options (we may keep some
compat in case someone relied on those defines for arm).
That removes the need for a RTE_CC_MEMCPY define.
More comments below:
[snip]
> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
> index 0ff70d9057..8be000294d 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -55,6 +55,26 @@ New Features
> Also, make sure to start the actual text at the margin.
> =======================================================
>
> +* **Compiler memcpy replaces custom DPDK implementation.**
> +
> + The memory copy functions of ``<rte_memcpy.h>`` now optionally
> + delegates to the standard memcpy() function, implemented by the
> + compiler and the C runtime (e.g., libc).
> +
> + In this release of DPDK, the handcrafted, per-architecture memory
> + copy implementations are still the default. Compiler memcpy is
> + enabled by setting the new ``use_cc_memcpy`` build option to true.
> +
> + The performance benefits of the custom DPDK rte_memcpy()
> + implementations have been diminishing with every new compiler
> + release, and with current toolchains the use of a custom memcpy()
> + implementation may even result in worse performance than the
> + standard memcpy().
> +
> + An additional benefit of using compiler memcpy is that compilers and
> + static analysis tools have an easier time detecting incorrect usage
> + of rte_memcpy() (e.g., buffer overruns, or overlapping source and
> + destination buffers).
As explained in the RN comments, an entry should use the form:
* **Add a title in the past tense with a full stop.**
Add a short 1-2 sentence description in the past tense.
The description should be enough to allow someone scanning
the release notes to understand the new feature.
It seems this note is a copy/paste of the commit log, please adjust
the title and make the description shorter.
>
> Removed Items
> -------------
[snip]
> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..cfb0175bd2 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -5,12 +5,19 @@
> #ifndef _RTE_MEMCPY_H_
> #define _RTE_MEMCPY_H_
>
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> /**
> * @file
> *
> * Functions for vectorised implementation of memcpy().
> */
>
> +#include <stdint.h>
> +#include <string.h>
I don't think those includes should go in a extern "C" { block.
> +
> /**
> * Copy 16 bytes from one location to another using optimised
> * instructions. The locations should not overlap.
> @@ -35,8 +42,6 @@ rte_mov16(uint8_t *dst, const uint8_t *src);
> static inline void
> rte_mov32(uint8_t *dst, const uint8_t *src);
>
> -#ifdef __DOXYGEN__
> -
This strange check was added as not all architectures provide
rte_mov48 (/me slaps Adrien and Thomas).
I think the CI reported no issue because of a problem in the next
patch where all that is tested is RTE_USE_CC_MEMCPY = true
combination.
Still, the overall goal of this work is to drop the whole rte_memcpy
thing in the future, so I think we can live with this #ifdef
__DOXYGEN__ non sense hiding the absence of rte_mov48 in x86...
> /**
> * Copy 48 bytes from one location to another using optimised
> * instructions. The locations should not overlap.
> @@ -49,8 +54,6 @@ rte_mov32(uint8_t *dst, const uint8_t *src);
> static inline void
> rte_mov48(uint8_t *dst, const uint8_t *src);
>
> -#endif /* __DOXYGEN__ */
> -
> /**
> * Copy 64 bytes from one location to another using optimised
> * instructions. The locations should not overlap.
> @@ -87,8 +90,6 @@ rte_mov128(uint8_t *dst, const uint8_t *src);
> static inline void
> rte_mov256(uint8_t *dst, const uint8_t *src);
>
> -#ifdef __DOXYGEN__
> -
> /**
> * Copy bytes from one location to another. The locations must not overlap.
> *
> @@ -111,6 +112,52 @@ rte_mov256(uint8_t *dst, const uint8_t *src);
> static void *
> rte_memcpy(void *dst, const void *src, size_t n);
>
> -#endif /* __DOXYGEN__ */
Removing this DOXYGEN here should be ok.
CI will tell us.
> diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
> index 52d2f8e969..09c2fe2485 100644
> --- a/lib/eal/x86/include/meson.build
> +++ b/lib/eal/x86/include/meson.build
> @@ -16,6 +16,7 @@ arch_headers = files(
> 'rte_spinlock.h',
> 'rte_vect.h',
> )
> +
Unrelated change.
> arch_indirect_headers = files(
> 'rte_atomic_32.h',
> 'rte_atomic_64.h',
--
David Marchand
More information about the dev
mailing list