[dpdk-dev] [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare exchange

Phil Yang (Arm Technology China) Phil.Yang at arm.com
Fri Jul 19 15:56:27 CEST 2019


> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj at marvell.com>
> Sent: Friday, July 19, 2019 8:35 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang at arm.com>; dev at dpdk.org
> Cc: thomas at monjalon.net; hemant.agrawal at nxp.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli at arm.com>; Gavin Hu (Arm Technology
> China) <Gavin.Hu at arm.com>; nd <nd at arm.com>; gage.eads at intel.com; nd
> <nd at arm.com>
> Subject: RE: [EXT] [PATCH v3 1/3] eal/arm64: add 128-bit atomic compare
> exchange
> 
> > > > +#define RTE_HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > > > +__ATOMIC_RELEASE) #define RTE_HAS_RLS(mo) ((mo) ==
> > > > __ATOMIC_RELEASE || \
> > > > +			 (mo) == __ATOMIC_ACQ_REL || \
> > > > +			 (mo) == __ATOMIC_SEQ_CST)
> > > > +
> > > > +#define RTE_MO_LOAD(mo)  (RTE_HAS_ACQ((mo)) \
> > > > +		? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED) #define
> > > > RTE_MO_STORE(mo)
> > > > +(RTE_HAS_RLS((mo)) \
> > > > +		? __ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > > > +
> > >
> > > The one starts with RTE_ are public symbols, If it is generic enough,
> > > Move to common layer so that every architecturse can use.
> > > If you think, otherwise make it internal
> >
> > Let's keep it internal. I will remove the 'RTE_' tag.
> 
> Probably change to __HAS_ACQ to avoid collision(just in case)

OK.

> 
> > >
> > >
> > >
> > > > +#ifdef __ARM_FEATURE_ATOMICS
> > >
> > > This define is added in gcc 9.1 and I believe for clang it is not supported
> yet.
> > > So old gcc and clang this will be undefined.
> > > I think, With meson + native build, we  can find the presence of
> > > ATOMIC support by running a.out. Not sure about make and cross build
> case.
> > > I don't want block this feature because of this, IMO, We can add this
> > > code with  existing __ARM_FEATURE_ATOMICS scheme and later find a
> > > method to enhance it. But please check how to fix it.
> >
> > OK.
> 
> After thinking on this a bit, I think,  in order to support old gcc(< gcc 9.1) and
> clang,
> We can introduce a config option, where, by default it is disabled and enable
> In specific config(where we know, lse is supported) and meson config.
> 
> i.e
> #if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)

Cool

> 
> 
> >
> > >
> > > > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)
> \
> > > > +static inline rte_int128_t                                                  \
> > > > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > > > +		rte_int128_t updated)                                               \
> > > > +{                                                                           \
> > > > +	/* caspX instructions register pair must start from even-numbered
> > > > +	 * register at operand 1.
> > > > +	 * So, specify registers for local variables here.
> > > > +	 */                                                                     \
> > > > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];                \
> > >
> > > Since direct x0 register used in the code and
> > > cas_op_name() and rte_atomic128_cmp_exchange() is inline function,
> > > Based on parent function load, we may corrupt x0 register aka
> >
> > Since x0/x1 and x2/x3 are used a lot and often contain live values.
> > Maybe to change them to some relatively less frequently used registers
> like
> > x14/x15 and x16/x17 might help for this case?
> > According to the PCS (Procedure Call Standard), x14-x17 are also temporary
> > registers.
> 
> X14-x17 are temporary registers but since
> cas_op_name() and rte_atomic128_cmp_exchange() are inline functions,
> Based on the parent function register usage, it _may_ corrupt.

Just checked how Linux Kernel does similar things:
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/atomic_lse.h#L19 

Same methods.

I will finish the benchmarking for the no_inline approach. If it has no significant performance loss, I think we can make it as no_inline.  

> 
> 
> >
> > > Break arm64 ABI. Not sure clobber list will help here or not?
> >
> > In my understanding, for the register variable, if it contains a live value in
> the
> > specified register, the compiler will move the live value into a free register.
> > Since x0~x3 are present in the input/output operands and x0/x1's value
> needs to
> > be restored to the variable 'old' as a return value.
> > So I didn't add them into the clobber list.
> 
> OK
> 
> >
> > > Making it as no_inline will help but not sure about the performance
> impact.
> > > May be you can check with compiler team.
> > >
> > > We burned our hands with this scheme, see
> > > 5b40ec6b966260e0ff66a8a2c689664f75d6a0e6 ("mempool/octeontx2: fix
> > > possible arm64 ABI break")
> > >
> > > Probably we can choose a scheme for rc2 and adjust as when we have
> > > complete clarity.
> > >
> > > >
> > > > +#if defined(RTE_ARCH_X86_64) || defined(RTE_ARCH_ARM64)
> > >
> > > There is nothing specific to x86 and arm64 here, Can we remove this
> #ifdef ?
> >
> > Without this constraint, it will break 32-bit x86 builds.
> > http://mails.dpdk.org/archives/test-report/2019-June/086586.html
> 
> OK . #ifdef RTE_ARCH_64 would help then.

OK.

> 
> >
> > >
> > > > +/**
> > > > + * 128-bit integer structure.
> > > > + */
> > > > +RTE_STD_C11
> > > > +typedef struct {
> > > > +	RTE_STD_C11
> > > > +	union {
> > > > +		uint64_t val[2];
> > > > +		__extension__ __int128 int128;
> 
> Instead of guarding  RTE_ARCH_64 on this complete structure,
> How about it only under
> #ifdef RTE_ARCH_64
> __extension__ __int128 int128;
> #endif
> So that it rte_int128_t will be available for 32bit as well.

Agree, it should be work. But I am not sure. 

Hi Gage,

How do you think about this? 

> 
> 
> > > > +	};
> > > > +} __rte_aligned(16) rte_int128_t;
> > > > +#endif
> > > > +
> > > >  #ifdef __DOXYGEN__


More information about the dev mailing list