[dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping

Guduri Prathyusha gprathyusha at caviumnetworks.com
Fri Nov 3 06:42:45 CET 2017


On Fri, Nov 03, 2017 at 11:21:43AM +0800, Jianbo.Liu at arm.com wrote:
> The 11/02/2017 15:52, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha at caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 3:34 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>
> > > Cc: dev at dpdk.org; Jianbo.Liu at arm.com; guduriprathyusha at gmail.com; Kantecki, Tomasz <tomasz.kantecki at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > > Hi,
> > > Hi
> > > >
> > > > > -----Original Message-----
> > > > > From: Guduri Prathyusha [mailto:gprathyusha at caviumnetworks.com]
> > > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > > To: Kantecki, Tomasz <tomasz.kantecki at intel.com>
> > > > > Cc: Jianbo.Liu at arm.com; guduriprathyusha at gmail.com; Ananyev, Konstantin <konstantin.ananyev at intel.com>; dev at dpdk.org; Guduri
> > > > > Prathyusha <gprathyusha at caviumnetworks.com>
> > > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > > >
> > > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
>
> May I ask the detail version about the gcc you are using?
>
Am using gcc provided by ubuntu rootfs

root at localhost:~/dpdk# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/5/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr --program-suffix=-5 --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-arm64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-arm64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-arm64
--with-arch-directory=aarch64
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu
--host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)

> > > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > > pointers are assumed to be pointing to different memory and compiler
> > > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > > grouping algorithm.
> > > > >
> > > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > > not call its library implementation but uses appropriate LD and ST
> > > > > instructions directly and hence no performance overhead.
> > > > >
> > > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > > Signed-off-by: Guduri Prathyusha <gprathyusha at caviumnetworks.com>
> > > > > ---
> > > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > > index 4bc161394..10a602a04 100644
> > > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >              uint16x8_t dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > > >
> > > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > > >         }
> > > > >
> > > > >         return lp;
> > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > > index 831760f02..79a71d77e 100644
> > > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > > >  static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >
> > > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > >
> > > > Could you explain a bit more here - which exactly instructions were reordered
> > > > and what kind of problems did it cause?
> > > > Specially on IA?
> > >
> > > This issue is observed on ARM since ARM gcc is more aggressive in
> > > reordering than x86 gcc.
> >
> > Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> > Unless there is a reproducible problem with x86 -
> > my preference would be to keep that file intact.
> >
> > > In ARM when v != GRPMSK, the following
> > > instructions ordering is not guarenteed because of strict aliasing.
> > >
> > > lp[0] += gptbl[v].lpv;
> > > pnum->u64 = gptbl[v].pnum;
> > > pnum->u16[FWDSTEP] = 1;
> > > lp = pnum->u16 + gptbl[v].idx;
> >
> > Ok, so what in particular is reordered by the compiler:
> >
> >  lp[0] += gptbl[v].lpv; (1)
> >  pnum->u64 = gptbl[v].pnum; (2)
> >  pnum->u16[FWDSTEP] = 1;   (3)
> >  lp = pnum->u16 + gptbl[v].idx; (4)
> >
> > (2) and (3)?
> > If so I am not sure how it could be a problem:
> > they do stores to the different locations.
> > (1) and (4) as I can see shouldn't be reordered.
> > Anyway - if you think this a compiler reordering issue,
> > then adding rte_compiler_barrier() should fix the issue, right?
>
> Agree.
>
Yes, rte_compiler_barrier() solves the issue. I will spin a V2 if you
are ok with it.
> >
> > >
> > > That results in wrong lp[0] updation.
> > > memcpy in this case will avoid this problem.
> > >
> > > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > > it is a huge inline function - way too much to copy just 64 bit variable.
> > >
> > > I agree that rte_memcpy is overhead in this case but how about using
> > > memcpy that will not use library implementation if the size is constant.
> > > memcpy with constant size uses built_in_memcpy that does not add
> > > performance overhead.
> >
> > On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> > ib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >
> > >
> > > Thoughts?
> >
> > As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> > If it does (still not sure how) - check would compiler barrier help here.
> > Konstantin
> >
>
> --
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


More information about the dev mailing list