[dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Thu May 14 03:02:58 CEST 2020


<snip>
> >
> > > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > > > >
> > > > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > > > other-multi-copy atomicity memory model.
> > > > > >
> > > > > > Armv8-a memory model has been strengthened to require
> > > > > > other-multi-copy atomicity. This property requires memory
> > > > > > accesses from an observer to become visible to all other
> > > > > > observers simultaneously [3]. This means
> > > > > >
> > > > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > > > >    visible to all CPUs
> > > > > > b) A write that is visible to all CPUs is also visible to all other
> > > > > >    observers in the shareability domain
> > > > > >
> > > > > > This allows for using cheaper DMB instructions in the place of
> > > > > > DSB for devices that are visible to all CPUs (i.e. devices that DPDK
> caters to).
> > > > > >
> > > > > > Please refer to [1], [2] and [3] for more information.
> > > > > >
> > > > > > [1]
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
> > > > > > .git /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli at arm.com>
> > > > > > ---
> > > > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > index 7b7099cdc..e406411bb 100644
> > > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > @@ -19,11 +19,11 @@ extern "C" {  #include <rte_compat.h>
> > > > > > #include <rte_debug.h>
> > > > > >
> > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > > > >
> > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > > >
> > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > > >
> > > > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > > > >
> > > > > > @@ -37,9 +37,9 @@ extern "C" {
> > > > > >
> > > > > >  #define rte_io_rmb() rte_rmb()
> > > > > >
> > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_wmb() rte_wmb()
> > > > > >
> > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_rmb() rte_rmb()
> > > > > >
> > > > > >  /*------------------------ 128 bit atomic operations
> > > > > > -------------------------*/
> > > > > >
> > > > > > --
> > > > > > 2.17.1
> > > > >
> > > > > This change showed about 7% performance gain in testpmd single
> > > > > core
> > > > NDR test.
> > > >
> > > > I am trying to understand this patch wrt DPDK current usage model?
> > > >
> > > > 1)  Is performance improvement due to the fact that the PMD that
> > > > you are using it for testing suppose to use existing rte_cio_* but
> > > > it was using rte_[rw]mb?
> > No, it is supposed to use rte_[rw]mb for x86.
> 
> Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b
> created for this pupose.
> 
> But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it correct
> from x86 PoV?
> @Ananyev, Konstantin @Richardson, Bruce ?
> 
> For x86:
> #define rte_io_wmb() rte_compiler_barrier() #define rte_io_rmb()
> rte_compiler_barrier() #define rte_cio_wmb() rte_compiler_barrier() #define
> rte_cio_rmb() rte_compiler_barrier()
We need a barrier API with 'DMB OSH*' for Arm and '*fence' for x86. My understanding is, '*fence' is required when WC memory is used in x86.

Also, from Arm architecture perspective, effectively we are saying that 'DSB' is not required for portable drivers.

> 
> 
> >
> > >
> > > This is part of the reason. There are also cases where rte_io_* was
> > > used and can be relaxed.
> > > Such as: http://patches.dpdk.org/patch/68162/
> > >
> > > > 2) In my understanding :
> > > > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > > > b) CPU to DMA/Device barrier requirements are addressed by
> > > > rte_cio_*
> > > > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> > > >
> > > > If (c) is true then we are violating the DPDK spec with change. Right?
> > No, we are not. Essentially, due to the other-multi-copy atomicity behavior
> of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).
> 
> Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to
> have the barrier between 4 of them?
> 
> 1) Global memory (BSS and Data sections), Not mapped as a hugepage.
> 2) Hugepage memory
> 3) IOVA memory
> 4) PCI register read/write
> 
> Do we need to worry about anything else which is specific to DSB?
> example, TLB related flush etc.
Yes, things like TLB flush or self modifying code still need DSB. But, my understanding is we do not have such code in DPDK and such code will be platform specific.

> 
> If we are talking this path then rte_cio_[rw]mb() has no meaning in DPDK as
> an abstraction as it was created for arm64 for this specific purpose.
> If we can meet all DPDK usecse with DMB OSH then probably we can
> deprecate rte_cio_wmb to avoid confusion.
Agree, rte_cio_*mb is confusing to me. We could deprecate those.

I see Octeon TX/TX2 drivers using rte_*mb. Do you see any issues with this change in those drivers?

This is a very fundamental change, we need more feedback from others working with Arm platforms.

> 
> >
> > >
> > > Developers are still required to use correct barrier APIs for different use
> cases.
> > > I think this change mitigates performance penalty when non optimal
> > > barrier is used.
> > >
> > > > This change will not be required if fastpath (CPU to Device) is
> > > > using
> > > rte_cio_*.
> > > > Right?
> > Yes. It is required when the fastpath uses rte_[rw]mb.
> >
> > >
> > > See 1). Correct usage of rte_cio_* is not the whole.
> > > For some other use cases, such as barrier between accesses of
> > > different memory types, we can also use lighter barrier 'dmb'.
> > >
> > > >
> > > >
> > > >
> > > > > Tested-by: Ruifeng Wang <ruifeng.wang at arm.com>
> > > > >
> >


More information about the dev mailing list