[dpdk-dev] [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way barrier builtins

Gavin Hu (Arm Technology China) Gavin.Hu at arm.com
Thu Dec 27 10:02:42 CET 2018



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj at marvell.com>
> Sent: Thursday, December 27, 2018 3:42 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>; dev at dpdk.org
> Cc: david.marchand at redhat.com; chaozhu at linux.vnet.ibm.com; nd
> <nd at arm.com>; bruce.richardson at intel.com; thomas at monjalon.net;
> hemant.agrawal at nxp.com; stephen at networkplumber.org; Honnappa
> Nagarahalli <Honnappa.Nagarahalli at arm.com>
> Subject: Re: [EXT] [PATCH v3 5/6] spinlock: reimplement with atomic one-way
> barrier builtins
> 
> On Thu, 2018-12-27 at 12:13 +0800, Gavin Hu wrote:
> -------------------------------------------------------------------
> > ---
> > The __sync builtin based implementation generates full memory
> > barriers
> > ('dmb ish') on Arm platforms. Using C11 atomic builtins to generate
> > one way
> > barriers.
> >
> > Here is the assembly code of __sync_compare_and_swap builtin.
> > __sync_bool_compare_and_swap(dst, exp, src);
> >    0x000000000090f1b0 <+16>:    e0 07 40 f9 ldr x0, [sp, #8]
> >    0x000000000090f1b4 <+20>:    e1 0f 40 79 ldrh    w1, [sp, #6]
> >    0x000000000090f1b8 <+24>:    e2 0b 40 79 ldrh    w2, [sp, #4]
> >    0x000000000090f1bc <+28>:    21 3c 00 12 and w1, w1, #0xffff
> >    0x000000000090f1c0 <+32>:    03 7c 5f 48 ldxrh   w3, [x0]
> >    0x000000000090f1c4 <+36>:    7f 00 01 6b cmp w3, w1
> >    0x000000000090f1c8 <+40>:    61 00 00 54 b.ne    0x90f1d4
> > <rte_atomic16_cmpset+52>  // b.any
> >    0x000000000090f1cc <+44>:    02 fc 04 48 stlxrh  w4, w2, [x0]
> >    0x000000000090f1d0 <+48>:    84 ff ff 35 cbnz    w4, 0x90f1c0
> > <rte_atomic16_cmpset+32>
> >    0x000000000090f1d4 <+52>:    bf 3b 03 d5 dmb ish
> >    0x000000000090f1d8 <+56>:    e0 17 9f 1a cset    w0, eq  // eq =
> > none
> >
> > The benchmarking results showed 3X performance gain on Cavium
> > ThunderX2 and
> > 13% on Qualcomm Falmon and 3.7% on 4-A72 Marvell macchiatobin.
> > Here is the example test result on TX2:
> >
> > *** spinlock_autotest without this patch ***
> > Core [123] Cost Time = 639822 us
> > Core [124] Cost Time = 633253 us
> > Core [125] Cost Time = 646030 us
> > Core [126] Cost Time = 643189 us
> > Core [127] Cost Time = 647039 us
> > Total Cost Time = 95433298 us
> >
> > *** spinlock_autotest with this patch ***
> > Core [123] Cost Time = 163615 us
> > Core [124] Cost Time = 166471 us
> > Core [125] Cost Time = 189044 us
> > Core [126] Cost Time = 195745 us
> > Core [127] Cost Time = 78423 us
> > Total Cost Time = 27339656 us
> >
> > Signed-off-by: Gavin Hu <gavin.hu at arm.com>
> > Reviewed-by: Phil Yang <phil.yang at arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl at arm.com>
> > Reviewed-by: Steve Capper <Steve.Capper at arm.com>
> > ---
> >  lib/librte_eal/common/include/generic/rte_spinlock.h | 18
> > +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > index c4c3fc31e..87ae7a4f1 100644
> > --- a/lib/librte_eal/common/include/generic/rte_spinlock.h
> > +++ b/lib/librte_eal/common/include/generic/rte_spinlock.h
> > @@ -61,9 +61,14 @@ rte_spinlock_lock(rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_lock(rte_spinlock_t *sl)
> >  {
> > -	while (__sync_lock_test_and_set(&sl->locked, 1))
> > -		while(sl->locked)
> > +	int exp = 0;
> > +
> > +	while (!__atomic_compare_exchange_n(&sl->locked, &exp, 1, 0,
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> {
> 
> How about remove explict exp = 0 and change to
> __atomic_test_and_set(flag, __ATOMIC_ACQUIRE);
> 
> i.e
> while (_atomic_test_and_set(flag, __ATOMIC_ACQUIRE))
> 
Will do it in v4.
> 
> > +		while (__atomic_load_n(&sl->locked, __ATOMIC_RELAXED))
> >  			rte_pause();
> > +		exp = 0;
> 
> We can remove exp = 0 with above scheme.
> 
> > +	}
> >  }
> >  #endif
> >
> > @@ -80,7 +85,7 @@ rte_spinlock_unlock (rte_spinlock_t *sl);
> >  static inline void
> >  rte_spinlock_unlock (rte_spinlock_t *sl)
> >  {
> > -	__sync_lock_release(&sl->locked);
> > +	__atomic_store_n(&sl->locked, 0, __ATOMIC_RELEASE);
>  }
> >  #endif
> >
> > @@ -99,7 +104,10 @@ rte_spinlock_trylock (rte_spinlock_t *sl);
> >  static inline int
> >  rte_spinlock_trylock (rte_spinlock_t *sl)
> >  {
> > -	return __sync_lock_test_and_set(&sl->locked,1) == 0;
> > +	int exp = 0;
> > +	return __atomic_compare_exchange_n(&sl->locked, &exp, 1,
> > +				0, /* disallow spurious failure */
> > +				__ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> Here to remove explicit exp.
> 
> return (__atomic_test_and_set(flag, __ATOMIC_ACQUIRE) == 0)

Will do it in v4.

> >  }
> >  #endif
> >
> > @@ -113,7 +121,7 @@ rte_spinlock_trylock (rte_spinlock_t *sl)
> >   */
> >  static inline int rte_spinlock_is_locked (rte_spinlock_t *sl)
> >  {
> > -	return sl->locked;
> > +	return __atomic_load_n(&sl->locked, __ATOMIC_ACQUIRE);
> 
> __ATOMIC_RELAXED would be enough here. Right ?
Yes, it is enough for current DPDK uses, used for testing and assertions only.

For general applicability, we set acquire as concerned about it is used for reading protected data while the lock is not taken by anybody.
In this use case, Acquire will properly see all updates from before the lock was released, but this is still dangerous, as during the course, someone else might have taken the lock and changed the data.

Anyway, I will set Relaxed in v4 as the above use scenario was not recommended and not present in DPDK. 

> 
> >  }
> >
> >  /**


More information about the dev mailing list