[PATCH v11] eal/x86: optimize memcpy of small sizes

Thomas Monjalon thomas at monjalon.net
Mon Jun 1 15:38:25 CEST 2026


22/05/2026 00:42, Stephen Hemminger:
> On Thu, 21 May 2026 18:56:31 +0000
> Morten Brørup <mb at smartsharesystems.com> wrote:
> 
> > The implementation for copying up to 64 bytes does not depend on address
> > alignment with the size of the CPU's vector registers. Nonetheless, the
> > exact same code for copying up to 64 bytes was present in both the aligned
> > copy function and all the CPU vector register size specific variants of
> > the unaligned copy functions.
> > With this patch, the implementation for copying up to 64 bytes was
> > consolidated into one instance, located in the common copy function,
> > before checking alignment requirements.
> > This provides three benefits:
> > 1. No copy-paste in the source code.
> > 2. A performance gain for copying up to 64 bytes, because the
> > address alignment check is avoided in this case.
> > 3. Reduced instruction memory footprint, because the compiler only
> > generates one instance of the function for copying up to 64 bytes, instead
> > of two instances (one in the unaligned copy function, and one in the
> > aligned copy function).
> > 
> > Furthermore, __rte_restrict was added to source and destination addresses.
> > 
> > Also, the missing implementation of rte_mov48() was added.
> > 
> > Until recently, some drivers required disabling stringop-overflow warnings
> > when using rte_memcpy().
> > For some strange reason, these warnings were disabled in the rte_memcpy
> > header file, instead of in the problematic drivers.
> > With series-38174 ("remove use of rte_memcpy from net/intel"), the
> > problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> > so disabling these warnings is no longer required, and was removed.
> > 
> > Regarding performance...
> > The memcpy performance test (cache-to-cache copy) shows:
> > Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> > Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> > Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> > Copying 64 bytes takes 4 cycles, versus 7 cycles before.
> > 
> > Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
> > 
> > Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> > Acked-by: Bruce Richardson <bruce.richardson at intel.com>
> > Acked-by: Konstantin Ananyev <konstantin.ananyev at huawei.com>
> 
> Here is the full wordy all providers reviews.
[...]
> Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0

What is the followup?
Do we target DPDK 26.07?




More information about the dev mailing list