[dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Thu Jan 24 06:21:43 CET 2019


> >
> > Added other platform owners.
> >
> > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > *v)  }  #endif
> > > > > > >
> > > > > > > +static inline int
> > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > > > +uint64_t
> > > > > > > +*src) {
> > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > > should be pointers to 128b (__int128)? Or we could define our
> > > > > > own data
> > > > > type.
> > > > >
> > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > __int128 originally because it fails to compile with -pedantic,
> > > > > but on second thought (and with your suggestion of a separate
> > > > > data type), we can resolve that with this typedef:
> > > > >
> > > > > typedef struct {
> > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > ok
> > > >
> > > > >
> > > > > > Since, it is a new API, can we define it with memory orderings
> > > > > > which will be more conducive to relaxed memory ordering based
> > > architectures?
> > > > > > You can refer to [1] and [2] for guidance.
> > > > >
> > > > > I certainly see the value in controlling the operation's memory
> > > > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > > > patchset is the right place to address that. I see that work
> > > > > going a couple
> > > > ways:
> > > > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > > > arguments. In that case, I'd prefer this be done in a separate
> > > > > patchset that addresses all the atomic operations, not just
> > > > > cmpset, so the interface changes are chosen according to the
> > > > > needs of the full set of atomic operations. If this approach is
> > > > > taken then there's no need to solve this while
> > > > > rte_atomic128_cmpset is experimental, since all the
> > > > other functions are non-experimental anyway.
> > > > >
> > > > > - Or -
> > > > >
> > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > > > strongly ordered behavior), and instead create new versions of
> > > > > them that take additional arguments. In this case, we can
> > > > > implement
> > > > > rte_atomic128_cmpset() as is and create a more flexible version
> > > > > in a later
> > > > patchset.
> > > > >
> > > > > Either way, I think the current interface (w.r.t. memory
> > > > > ordering
> > > > > options) can work and still leaves us in a good position for
> > > > > future
> > > > changes/improvements.
> > > > >
> > > > I do not see the need to modify/extend the existing rte_atomicN_*
> > > > APIs as the corresponding __atomic intrinsics serve as replacements.
> > > > I expect that at some point, DPDK code base will not be using
> > > rte_atomicN_* APIs.
> > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > Hence
> > >
> > > I don't think that's correct. From the GCC docs:
> > >
> > > "16-byte integral types are also allowed if `__int128' (see
> > > __int128) is supported by the architecture."
> > >
> > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > >
> > > Source:
> > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > Builtins.html
> > (following is based on my reading, not based on experiments) My
> > understanding is that the __atomic_load_n/store_n were implemented
> > using a compare-and- swap HW instruction (due to lack of HW 128b
> > atomic load and store instructions). This introduced the side effect
> > or store/load respectively. Where as the user does not expect such
> > side effects. As suggested in [1], it looks like GCC delegated the
> > implementation to libatomic which 'it seems' uses locks to implement
> > 128b __atomic intrinsics (needs to be verified)
> >
> > If __atomic intrinsics, for any of the supported platforms, do not
> > have an optimal implementation, I prefer to add a DPDK API as an
> > abstraction. A given platform can choose to implement such an API
> > using __atomic intrinsics if it wants. The DPDK API can be similar to
> __atomic_compare_exchange_n.
> >
> 
> Certainly. From the linked discussions, I see how this would affect the design
> of (hypothetical functions) rte_atomic128_read() and rte_atomic128_set(),
> but I don't see anything that suggests (for the architectures being discussed)
> that __atomic_compare_exchange_n is suboptimal.
I wrote some code and generated assembly to verify what is happening. On aarch64, this call is delegated to libatomic and libatomic needs to be linked. In the generated assembly, it is clear that it uses locks (pthread mutex lock) to provide atomicity for. For 32b and 64b the compiler generates the expected inline assembly. Both, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are not lock free. (gcc - 8.2)

Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even after using -mcx16 flag the call is delegated to libatomic. I see the 'lock cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, ' __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I am not an expert on x86.

[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

These problems do not exist with 32b and 64b.

> 
> > [1] https://patchwork.ozlabs.org/patch/721686/
> > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> >
> > >
> > > > DPDK needs to write its own. Since this is the first API in that
> > > > regard, I prefer that we start with a signature that resembles
> > > > __atomic intrinsics which have been proven to provide best
> > > > flexibility for
> > > all the platforms supported by DPDK.


More information about the dev mailing list