[dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic compare exchange
Phil Yang (Arm Technology China)
Phil.Yang at arm.com
Tue Oct 15 13:32:30 CEST 2019
Hi David,
Thanks for your comments. I have addressed most of them in v10. Please review it.
Some comments inline.
> -----Original Message-----
> From: David Marchand <david.marchand at redhat.com>
> Sent: Monday, October 14, 2019 11:44 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang at arm.com>
> Cc: thomas at monjalon.net; jerinj at marvell.com; Gage Eads
> <gage.eads at intel.com>; dev <dev at dpdk.org>; hemant.agrawal at nxp.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; Gavin Hu (Arm
> Technology China) <Gavin.Hu at arm.com>; nd <nd at arm.com>
> Subject: Re: [dpdk-dev] [PATCH v9 1/3] eal/arm64: add 128-bit atomic
> compare exchange
>
> On Wed, Aug 14, 2019 at 10:29 AM Phil Yang <phil.yang at arm.com> wrote:
> >
> > Add 128-bit atomic compare exchange on aarch64.
>
> A bit short, seeing the complexity of the code and the additional
> RTE_ARM_FEATURE_ATOMICS config flag.
Updated in v10.
<snip>
> >
> > +/*------------------------ 128 bit atomic operations -------------------------*/
> > +
> > +#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) !=
> __ATOMIC_RELEASE)
> > +#define __HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE || (mo) ==
> __ATOMIC_ACQ_REL || \
> > + (mo) == __ATOMIC_SEQ_CST)
> > +
> > +#define __MO_LOAD(mo) (__HAS_ACQ((mo)) ? __ATOMIC_ACQUIRE :
> __ATOMIC_RELAXED)
> > +#define __MO_STORE(mo) (__HAS_RLS((mo)) ? __ATOMIC_RELEASE :
> __ATOMIC_RELAXED)
>
> Those 4 first macros only make sense when LSE is not available (see below
> [1]).
> Besides, they are used only once, why not directly use those
> conditions where needed?
Agree. I removed __MO_LOAD and __MO_STORE in v10 and kept the __HAS_ACQ and __HAS_REL under the non-LSE condition branch in v10.
I think they can make the code easy to read.
>
>
> > +
> > +#if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)
> > +#define __ATOMIC128_CAS_OP(cas_op_name, op_string) \
> > +static __rte_noinline rte_int128_t \
> > +cas_op_name(rte_int128_t *dst, rte_int128_t old, \
> > + rte_int128_t updated) \
> > +{ \
> > + /* caspX instructions register pair must start from even-numbered
> > + * register at operand 1.
> > + * So, specify registers for local variables here.
> > + */ \
> > + register uint64_t x0 __asm("x0") = (uint64_t)old.val[0]; \
> > + register uint64_t x1 __asm("x1") = (uint64_t)old.val[1]; \
> > + register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0]; \
> > + register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1]; \
> > + asm volatile( \
> > + op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]" \
> > + : [old0] "+r" (x0), \
> > + [old1] "+r" (x1) \
> > + : [upd0] "r" (x2), \
> > + [upd1] "r" (x3), \
> > + [dst] "r" (dst) \
> > + : "memory"); \
> > + old.val[0] = x0; \
> > + old.val[1] = x1; \
> > + return old; \
> > +}
> > +
> > +__ATOMIC128_CAS_OP(__rte_cas_relaxed, "casp")
> > +__ATOMIC128_CAS_OP(__rte_cas_acquire, "caspa")
> > +__ATOMIC128_CAS_OP(__rte_cas_release, "caspl")
> > +__ATOMIC128_CAS_OP(__rte_cas_acq_rel, "caspal")
>
> If LSE is available, we expose __rte_cas_XX (explicitely) *non*
> inlined functions, while without LSE, we expose inlined __rte_ldr_XX
> and __rte_stx_XX functions.
> So we have a first disparity with non-inlined vs inlined functions
> depending on a #ifdef.
> Then, we have a second disparity with two sets of "apis" depending on
> this #ifdef.
>
> And we expose those sets with a rte_ prefix, meaning people will try
> to use them, but those are not part of a public api.
>
> Can't we do without them ? (see below [2] for a proposal with ldr/stx,
> cas should be the same)
No, it doesn't work.
Because we need to verify the return value at the end of the loop for these macros.
>
>
> > +#else
> > +#define __ATOMIC128_LDX_OP(ldx_op_name, op_string) \
> > +static inline rte_int128_t \
> > +ldx_op_name(const rte_int128_t *src) \
> > +{ \
> > + rte_int128_t ret; \
> > + asm volatile( \
> > + op_string " %0, %1, %2" \
> > + : "=&r" (ret.val[0]), \
> > + "=&r" (ret.val[1]) \
> > + : "Q" (src->val[0]) \
> > + : "memory"); \
> > + return ret; \
> > +}
> > +
> > +__ATOMIC128_LDX_OP(__rte_ldx_relaxed, "ldxp")
> > +__ATOMIC128_LDX_OP(__rte_ldx_acquire, "ldaxp")
> > +
> > +#define __ATOMIC128_STX_OP(stx_op_name, op_string) \
> > +static inline uint32_t \
> > +stx_op_name(rte_int128_t *dst, const rte_int128_t src) \
> > +{ \
> > + uint32_t ret; \
> > + asm volatile( \
> > + op_string " %w0, %1, %2, %3" \
> > + : "=&r" (ret) \
> > + : "r" (src.val[0]), \
> > + "r" (src.val[1]), \
> > + "Q" (dst->val[0]) \
> > + : "memory"); \
> > + /* Return 0 on success, 1 on failure */ \
> > + return ret; \
> > +}
> > +
> > +__ATOMIC128_STX_OP(__rte_stx_relaxed, "stxp")
> > +__ATOMIC128_STX_OP(__rte_stx_release, "stlxp")
> > +#endif
> > +
> > +static inline int __rte_experimental
>
> The __rte_experimental tag comes first.
Updated in v10.
>
>
> > +rte_atomic128_cmp_exchange(rte_int128_t *dst,
> > + rte_int128_t *exp,
> > + const rte_int128_t *src,
> > + unsigned int weak,
> > + int success,
> > + int failure)
> > +{
> > + /* Always do strong CAS */
> > + RTE_SET_USED(weak);
> > + /* Ignore memory ordering for failure, memory order for
> > + * success must be stronger or equal
> > + */
> > + RTE_SET_USED(failure);
> > + /* Find invalid memory order */
> > + RTE_ASSERT(success == __ATOMIC_RELAXED
> > + || success == __ATOMIC_ACQUIRE
> > + || success == __ATOMIC_RELEASE
> > + || success == __ATOMIC_ACQ_REL
> > + || success == __ATOMIC_SEQ_CST);
> > +
> > +#if defined(__ARM_FEATURE_ATOMICS) ||
> defined(RTE_ARM_FEATURE_ATOMICS)
> > + rte_int128_t expected = *exp;
> > + rte_int128_t desired = *src;
> > + rte_int128_t old;
> > +
> > + if (success == __ATOMIC_RELAXED)
> > + old = __rte_cas_relaxed(dst, expected, desired);
> > + else if (success == __ATOMIC_ACQUIRE)
> > + old = __rte_cas_acquire(dst, expected, desired);
> > + else if (success == __ATOMIC_RELEASE)
> > + old = __rte_cas_release(dst, expected, desired);
> > + else
> > + old = __rte_cas_acq_rel(dst, expected, desired);
> > +#else
>
> 1: the four first macros (on the memory ordering constraints) can be
> moved here then undef'd once unused.
> Or you can just do without them.
Updated in v10.
>
>
> > + int ldx_mo = __MO_LOAD(success);
> > + int stx_mo = __MO_STORE(success);
> > + uint32_t ret = 1;
> > + register rte_int128_t expected = *exp;
> > + register rte_int128_t desired = *src;
> > + register rte_int128_t old;
> > +
> > + /* ldx128 can not guarantee atomic,
> > + * Must write back src or old to verify atomicity of ldx128;
> > + */
> > + do {
> > + if (ldx_mo == __ATOMIC_RELAXED)
> > + old = __rte_ldx_relaxed(dst);
> > + else
> > + old = __rte_ldx_acquire(dst);
>
> 2: how about using a simple macro that gets passed the op string?
>
> Something like (untested):
>
> #define __READ_128(op_string, src, dst) \
> asm volatile( \
> op_string " %0, %1, %2" \
> : "=&r" (dst.val[0]), \
> "=&r" (dst.val[1]) \
> : "Q" (src->val[0]) \
> : "memory")
>
> Then used like this:
>
> if (ldx_mo == __ATOMIC_RELAXED)
> __READ_128("ldxp", dst, old);
> else
> __READ_128("ldaxp", dst, old);
>
> #undef __READ_128
>
> > +
> > + if (likely(old.int128 == expected.int128)) {
> > + if (stx_mo == __ATOMIC_RELAXED)
> > + ret = __rte_stx_relaxed(dst, desired);
> > + else
> > + ret = __rte_stx_release(dst, desired);
> > + } else {
> > + /* In the failure case (since 'weak' is ignored and only
> > + * weak == 0 is implemented), expected should contain
> > + * the atomically read value of dst. This means, 'old'
> > + * needs to be stored back to ensure it was read
> > + * atomically.
> > + */
> > + if (stx_mo == __ATOMIC_RELAXED)
> > + ret = __rte_stx_relaxed(dst, old);
> > + else
> > + ret = __rte_stx_release(dst, old);
>
> And:
>
> #define __STORE_128(op_string, dst, val, ret) \
> asm volatile( \
> op_string " %w0, %1, %2, %3" \
> : "=&r" (ret) \
> : "r" (val.val[0]), \
> "r" (val.val[1]), \
> "Q" (dst->val[0]) \
> : "memory")
>
> Used like this:
>
> if (likely(old.int128 == expected.int128)) {
> if (stx_mo == __ATOMIC_RELAXED)
> __STORE_128("stxp", dst, desired, ret);
> else
> __STORE_128("stlxp", dst, desired, ret);
> } else {
> /* In the failure case (since 'weak' is ignored and only
> * weak == 0 is implemented), expected should contain
> * the atomically read value of dst. This means, 'old'
> * needs to be stored back to ensure it was read
> * atomically.
> */
> if (stx_mo == __ATOMIC_RELAXED)
> __STORE_128("stxp", dst, old, ret);
> else
> __STORE_128("stlxp", dst, old, ret);
> }
>
> #undef __STORE_128
>
>
> > + }
> > + } while (unlikely(ret));
> > +#endif
> > +
> > + /* Unconditionally updating expected removes
> > + * an 'if' statement.
> > + * expected should already be in register if
> > + * not in the cache.
> > + */
> > + *exp = old;
> > +
> > + return (old.int128 == expected.int128);
> > +}
> > +
> > #ifdef __cplusplus
> > }
> > #endif
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > index 1335d92..cfe7067 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > @@ -183,18 +183,6 @@ static inline void
> rte_atomic64_clear(rte_atomic64_t *v)
> >
> > /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > -/**
> > - * 128-bit integer structure.
> > - */
> > -RTE_STD_C11
> > -typedef struct {
> > - RTE_STD_C11
> > - union {
> > - uint64_t val[2];
> > - __extension__ __int128 int128;
> > - };
> > -} __rte_aligned(16) rte_int128_t;
> > -
> > __rte_experimental
> > static inline int
> > rte_atomic128_cmp_exchange(rte_int128_t *dst,
> > diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h
> b/lib/librte_eal/common/include/generic/rte_atomic.h
> > index 24ff7dc..e6ab15a 100644
> > --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> > +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> > @@ -1081,6 +1081,20 @@ static inline void
> rte_atomic64_clear(rte_atomic64_t *v)
> >
> > /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > +/**
> > + * 128-bit integer structure.
> > + */
> > +RTE_STD_C11
> > +typedef struct {
> > + RTE_STD_C11
> > + union {
> > + uint64_t val[2];
> > +#ifdef RTE_ARCH_64
> > + __extension__ __int128 int128;
> > +#endif
>
> You hid this field for x86.
> What is the reason?
No, we are not hid it for x86. The RTE_ARCH_64 flag covered x86 as well.
Thanks,
Phil
More information about the dev
mailing list