[dpdk-dev] [PATCH v3] arch/arm: optimization for memcpy on AArch64

Jerin Jacob jerin.jacob at caviumnetworks.com
Tue Dec 19 08:24:32 CET 2017


-----Original Message-----
> Date: Tue, 19 Dec 2017 05:33:19 +0000
> From: Herbert Guan <Herbert.Guan at arm.com>
> To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
> CC: "dev at dpdk.org" <dev at dpdk.org>, nd <nd at arm.com>
> Subject: RE: [PATCH v3] arch/arm: optimization for memcpy on AArch64
> 
> Jerin,
> 
> Thanks for review and comments.  Please find my feedbacks below inline.
> 
> > -----Original Message-----
> > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > Sent: Monday, December 18, 2017 15:44
> > To: Herbert Guan <Herbert.Guan at arm.com>
> > Cc: dev at dpdk.org
> > Subject: Re: [PATCH v3] arch/arm: optimization for memcpy on AArch64
> > 
> > -----Original Message-----
> > > Date: Mon, 18 Dec 2017 10:54:24 +0800
> > > From: Herbert Guan <herbert.guan at arm.com>
> > > To: dev at dpdk.org, jerin.jacob at caviumnetworks.com
> > > CC: Herbert Guan <herbert.guan at arm.com>
> > > Subject: [PATCH v3] arch/arm: optimization for memcpy on AArch64
> > > X-Mailer: git-send-email 1.8.3.1
> > >
> > > Signed-off-by: Herbert Guan <herbert.guan at arm.com>
> > > ---
> > >  config/common_armv8a_linuxapp                      |   6 +
> > >  .../common/include/arch/arm/rte_memcpy_64.h        | 292
> > +++++++++++++++++++++
> > >  2 files changed, 298 insertions(+)
> > >
> > > diff --git a/config/common_armv8a_linuxapp
> > > b/config/common_armv8a_linuxapp index 6732d1e..8f0cbed 100644
> > > --- a/config/common_armv8a_linuxapp
> > > +++ b/config/common_armv8a_linuxapp
> > > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y  # to address
> > minimum
> > > DMA alignment across all arm64 implementations.
> > >  CONFIG_RTE_CACHE_LINE_SIZE=128
> > >
> > > +# Accelarate rte_memcpy.  Be sure to run unit test to determine the
> > 
> > Additional space before "Be". Rather than just mentioning the unit test,
> > mention the absolute test case name(memcpy_perf_autotest)
> > 
> > > +# best threshold in code.  Refer to notes in source file
> > 
> > Additional space before "Refer"
> 
> Fixed in new version.
> 
> > 
> > > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more
> > #
> > > +info.
> > > +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> > > +
> > >  CONFIG_RTE_LIBRTE_FM10K_PMD=n
> > >  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
> > >  CONFIG_RTE_LIBRTE_AVP_PMD=n
> > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > > b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > > index b80d8ba..1ea275d 100644
> > > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > > @@ -42,6 +42,296 @@
> > >
> > >  #include "generic/rte_memcpy.h"
> > >
> > > +#ifdef RTE_ARCH_ARM64_MEMCPY
> > 
> > See the comment below at "(GCC_VERSION < 50400)" check
> > 
> > > +#include <rte_common.h>
> > > +#include <rte_branch_prediction.h>
> > > +
> > > +/*
> > > + * The memory copy performance differs on different AArch64 micro-
> > architectures.
> > > + * And the most recent glibc (e.g. 2.23 or later) can provide a
> > > +better memcpy()
> > > + * performance compared to old glibc versions. It's always suggested
> > > +to use a
> > > + * more recent glibc if possible, from which the entire system can get
> > benefit.
> > > + *
> > > + * This implementation improves memory copy on some aarch64
> > > +micro-architectures,
> > > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is
> > > +disabled by
> > > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate.
> > > +It's not
> > > + * always providing better performance than memcpy() so users need to
> > > +run unit
> > > + * test "memcpy_perf_autotest" and customize parameters in
> > > +customization section
> > > + * below for best performance.
> > > + *
> > > + * Compiler version will also impact the rte_memcpy() performance.
> > > +It's observed
> > > + * on some platforms and with the same code, GCC 7.2.0 compiled
> > > +binaries can
> > > + * provide better performance than GCC 4.8.5 compiled binaries.
> > > + */
> > > +
> > > +/**************************************
> > > + * Beginning of customization section
> > > +**************************************/
> > > +#define ALIGNMENT_MASK 0x0F
> > 
> > This symbol will be included in public rte_memcpy.h version for arm64 DPDK
> > build.
> > Please use RTE_ prefix to avoid multi
> > definition.(RTE_ARCH_ARM64_ALIGN_MASK ? or any shorter name)
> > 
> Changed to RTE_AARCH64_ALIGN_MASK in new version.

Since it is something to do with memcpy and arm64, I prefer,
RTE_ARM64_MEMCPY_ALIGN_MASK


> 
> > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> > > +/* Only src unalignment will be treaed as unaligned copy */ #define
> > > +IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK)
> > #else
> > > +/* Both dst and src unalignment will be treated as unaligned copy */
> > > +#define IS_UNALIGNED_COPY(dst, src) \
> > > +		(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK)
> > #endif
> > > +
> > > +
> > > +/*
> > > + * If copy size is larger than threshold, memcpy() will be used.
> > > + * Run "memcpy_perf_autotest" to determine the proper threshold.
> > > + */
> > > +#define ALIGNED_THRESHOLD       ((size_t)(0xffffffff))
> > > +#define UNALIGNED_THRESHOLD     ((size_t)(0xffffffff))
> > 
> > Same as above comment.
> Added RTE_AARCH64_ prefix in new version.

Same as above.

> 
> > 
> > > +
> > > +/**************************************
> > > + * End of customization section
> > > + **************************************/
> > > +#ifdef RTE_TOOLCHAIN_GCC
> > > +#if (GCC_VERSION < 50400)
> > > +#warning "The GCC version is quite old, which may result in sub-optimal \
> > > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> > > +be used."
> > 
> > Even though it is warning, based on where this file get included it will
> > generate error(see below)
> > How about, selecting optimized memcpy when RTE_ARCH_ARM64_MEMCPY
> > && if (GCC_VERSION >= 50400) ?
> > 
> Fully understand that.  While I'm not tending to make it 'silent'.  GCC 4.x is just
> quite old and may not provide best optimized code -- not only for DPDK app.  
> We can provide another option RTE_AARCH64_SKIP_GCC_VERSION_CHECK to allow
> skipping the GCC version check.  How do you think?

I prefer to reduce the options. But, No strong opinion on this as this
the RTE_ARCH_ARM64_MEMCPY option is by default disabled(ie. No risk).




More information about the dev mailing list