[dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange

Phil Yang (Arm Technology China) Phil.Yang at arm.com
Wed Aug 14 12:24:50 CEST 2019


> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj at marvell.com>
> Sent: Wednesday, August 14, 2019 4:46 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang at arm.com>;
> thomas at monjalon.net; gage.eads at intel.com; dev at dpdk.org
> Cc: hemant.agrawal at nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli at arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu at arm.com>; nd <nd at arm.com>
> Subject: RE: [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange
> 
> > -----Original Message-----
> > From: Phil Yang <phil.yang at arm.com>
> > Sent: Wednesday, August 14, 2019 1:58 PM
> > To: thomas at monjalon.net; Jerin Jacob Kollanukkaran
> <jerinj at marvell.com>;
> > gage.eads at intel.com; dev at dpdk.org
> > Cc: hemant.agrawal at nxp.com; Honnappa.Nagarahalli at arm.com;
> > gavin.hu at arm.com; nd at arm.com
> > Subject: [EXT] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare
> > exchange
> > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> > +__ATOMIC_RELEASE) #define __HAS_RLS(mo) ((mo) ==
> > __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \
> > +					  (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define __MO_LOAD(mo)  (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> > +__ATOMIC_RELAXED) #define __MO_STORE(mo) (__HAS_RLS((mo)) ?
> > +__ATOMIC_RELEASE : __ATOMIC_RELAXED)
> > +
> > +#if defined(__ARM_FEATURE_ATOMICS) ||
> > defined(RTE_ARM_FEATURE_ATOMICS)
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
> > +static __rte_noinline rte_int128_t                                          \
> 
> 
> Could you check the cost of making it as __rte_noinline?
> If it is costly, How about having two versions, one with __rte_noinline
> to make compliance with arm64 procedure call standard for
> old gcc and clang.
> Other one without explicit register hardcoding + inline for latest
> gcc

Hi Jerin,

According to the stack_lf_perf_autotest, making it as __rte_noinline has no overhead on ThunderX2 with GCC 8.3.
The 'Average cycles per object push/pop' numbers for __rte_noinline and __rte_always_inline versions are nearly the same.

Test results :
###### Two NUMA Node ######
#### __rte_noinline ####

RTE>>stack_lf_perf_autotest
<snip>
### Testing using two NUMA nodes ###
Average cycles per object push/pop (bulk size: 8): 24.10
Average cycles per object push/pop (bulk size: 32): 6.85

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 680.39
Average cycles per object push/pop (bulk size: 32): 146.38
Test OK

#### __rte_always-inline ####
RTE>>stack_lf_perf_autotest
<snip>
### Testing using two NUMA nodes ###
Average cycles per object push/pop (bulk size: 8): 24.29
Average cycles per object push/pop (bulk size: 32): 6.92

### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 683.92
Average cycles per object push/pop (bulk size: 32): 145.11
Test OK

###### Single NUMA ######
#### __rte_always-inline ####

RTE>>stack_lf_perf_autotest
<snip>
### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 582.92
Average cycles per object push/pop (bulk size: 32): 125.57
Test OK
#### __rte_noinline ####

RTE>>stack_lf_perf_autotest
<snip>
### Testing on all 18 lcores ###
Average cycles per object push/pop (bulk size: 8): 537.56
Average cycles per object push/pop (bulk size: 32): 122.98
Test OK

Thanks,
Phil Yang

> 
> 
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
> > +		rte_int128_t updated)                                       \
> > +{                                                                           \
> > +	/* caspX instructions register pair must start from even-numbered
> > +	 * register at operand 1.
> > +	 * So, specify registers for local variables here.
> > +	 */                                                                 \
> > +	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];            \
> > +	register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];            \
> > +	register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];        \
> > +	register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];        \
> > +	asm volatile(                                                       \
> > +		op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"   \
> > +		: [old0] "+r" (x0),                                         \
> > +		[old1] "+r" (x1)                                            \
> > +		: [upd0] "r" (x2),                                          \
> > +		[upd1] "r" (x3),                                            \
> > +		[dst] "r" (dst)                                             \
> > +		: "memory");                                                \
> > +	old.val[0] = x0;                                                    \
> > +	old.val[1] = x1;                                                    \
> > +	return old;                                                         \
> > +}
> > +


More information about the dev mailing list