[dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping

Jianbo.Liu at arm.com Jianbo.Liu at arm.com
Fri Nov 3 04:21:43 CET 2017


The 11/02/2017 15:52, Ananyev, Konstantin wrote:
>
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyusha at caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 3:34 PM
> > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>
> > Cc: dev at dpdk.org; Jianbo.Liu at arm.com; guduriprathyusha at gmail.com; Kantecki, Tomasz <tomasz.kantecki at intel.com>
> > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > Hi,
> > Hi
> > >
> > > > -----Original Message-----
> > > > From: Guduri Prathyusha [mailto:gprathyusha at caviumnetworks.com]
> > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > To: Kantecki, Tomasz <tomasz.kantecki at intel.com>
> > > > Cc: Jianbo.Liu at arm.com; guduriprathyusha at gmail.com; Ananyev, Konstantin <konstantin.ananyev at intel.com>; dev at dpdk.org; Guduri
> > > > Prathyusha <gprathyusha at caviumnetworks.com>
> > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > >
> > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives

May I ask the detail version about the gcc you are using?

> > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > pointers are assumed to be pointing to different memory and compiler
> > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > grouping algorithm.
> > > >
> > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > not call its library implementation but uses appropriate LD and ST
> > > > instructions directly and hence no performance overhead.
> > > >
> > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > Signed-off-by: Guduri Prathyusha <gprathyusha at caviumnetworks.com>
> > > > ---
> > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > index 4bc161394..10a602a04 100644
> > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >              uint16x8_t dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > >
> > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > > >         }
> > > >
> > > >         return lp;
> > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > index 831760f02..79a71d77e 100644
> > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > >  static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >
> > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > >
> > > Could you explain a bit more here - which exactly instructions were reordered
> > > and what kind of problems did it cause?
> > > Specially on IA?
> >
> > This issue is observed on ARM since ARM gcc is more aggressive in
> > reordering than x86 gcc.
>
> Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> Unless there is a reproducible problem with x86 -
> my preference would be to keep that file intact.
>
> > In ARM when v != GRPMSK, the following
> > instructions ordering is not guarenteed because of strict aliasing.
> >
> > lp[0] += gptbl[v].lpv;
> > pnum->u64 = gptbl[v].pnum;
> > pnum->u16[FWDSTEP] = 1;
> > lp = pnum->u16 + gptbl[v].idx;
>
> Ok, so what in particular is reordered by the compiler:
>
>  lp[0] += gptbl[v].lpv; (1)
>  pnum->u64 = gptbl[v].pnum; (2)
>  pnum->u16[FWDSTEP] = 1;   (3)
>  lp = pnum->u16 + gptbl[v].idx; (4)
>
> (2) and (3)?
> If so I am not sure how it could be a problem:
> they do stores to the different locations.
> (1) and (4) as I can see shouldn't be reordered.
> Anyway - if you think this a compiler reordering issue,
> then adding rte_compiler_barrier() should fix the issue, right?

Agree.

>
> >
> > That results in wrong lp[0] updation.
> > memcpy in this case will avoid this problem.
> >
> > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > it is a huge inline function - way too much to copy just 64 bit variable.
> >
> > I agree that rte_memcpy is overhead in this case but how about using
> > memcpy that will not use library implementation if the size is constant.
> > memcpy with constant size uses built_in_memcpy that does not add
> > performance overhead.
>
> On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> ib/librte_eal/common/include/arch/x86/rte_memcpy.h
>
> >
> > Thoughts?
>
> As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> If it does (still not sure how) - check would compiler barrier help here.
> Konstantin
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


More information about the dev mailing list