[dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation

Eads, Gage gage.eads at intel.com
Mon Jan 28 19:54:25 CET 2019



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> Sent: Monday, January 28, 2019 4:36 AM
> To: jerinj at marvell.com; mczekaj at marvell.com; Eads, Gage
> <gage.eads at intel.com>; dev at dpdk.org
> Cc: olivier.matz at 6wind.com; stephen at networkplumber.org; nd
> <nd at arm.com>; Richardson, Bruce <bruce.richardson at intel.com>;
> arybchenko at solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev at intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
> 
> On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> >
> > >
> > > -----Original Message-----
> > > From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > To: Eads, Gage <gage.eads at intel.com>; dev at dpdk.org
> > > Cc: olivier.matz at 6wind.com; stephen at networkplumber.org; nd
> > > <nd at arm.com>; Richardson, Bruce <bruce.richardson at intel.com>;
> > > arybchenko at solarflare.com; Ananyev, Konstantin
> > > <konstantin.ananyev at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > implementation
> > >
> > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > > >
> > > > Hi Ola,
> > > >
> > > > <snip>
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > > > rte_ring *r);
> > > > > >  #endif
> > > > > >  #include "rte_ring_generic_64.h"
> > > > > >
> > > > > > +/* @internal 128-bit structure used by the non-blocking ring
> > > > > > +*/ struct nb_ring_entry {
> > > > > > +	void *ptr; /**< Data pointer */
> > > > > > +	uint64_t cnt; /**< Modification counter */
> > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > > > also be supported. I think there are some claims that DPDK still
> > > > > supports e.g.
> > > > > ARMv7a
> > > > > and possibly also 32-bit x86?
> > > > I chose a 64-bit modification counter because (practically
> > > > speaking) the ABA problem will not occur with such a large counter
> > > > -- definitely not within my lifetime. See the "Discussion" section
> > > > of the commit message for more information.
> > > >
> > > > With a 32-bit counter, there is a very (very) low likelihood of
> > > > it, but it is possible. Personally, I don't feel comfortable
> > > > providing such code, because a) I doubt all users would understand
> > > > the implementation well enough to do the risk/reward analysis, and
> > > > b) such a bug would be near impossible to reproduce and root-cause
> > > > if it did occur.
> > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g.
> > > ARMv7a and
> > > probably x86 as well) won't be able to support this as they at best
> > > support 64-bit CAS (ARMv7a has LDREXD/STREXD). So you are
> > > essentially putting a 64-bit (and 128-bit CAS) requirement on the
> > > implementation.
> > >
> > Yes, I am. I tried to make that clear in the cover letter.
> >
> > >
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > +};
> > > > > > +
> > > > > > +/* The non-blocking ring algorithm is based on the original
> > > > > > +rte ring (derived
> > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and
> > > > > > +Scott's non-blocking
> > > > > > + * concurrent queue.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * @internal
> > > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > > +(single-producer only)
> > > > > > + *
> > > > > > + * @param r
> > > > > > + *   A pointer to the ring structure.
> > > > > > + * @param obj_table
> > > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > > + * @param n
> > > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > > + * @param behavior
> > > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> > > > > > +to the ring
> > > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as
> > > > > > +possible to the ring
> > > > > > + * @param free_space
> > > > > > + *   returns the amount of space after the enqueue operation
> > > > > > +has finished
> > > > > > + * @return
> > > > > > + *   Actual number of objects enqueued.
> > > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > > + */
> > > > > > +static __rte_always_inline unsigned int
> > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> > > > > > *obj_table,
> > > > > > +			    unsigned int n,
> > > > > > +			    enum rte_ring_queue_behavior behavior,
> > > > > > +			    unsigned int *free_space) {
> > > > > > +	uint32_t free_entries;
> > > > > > +	size_t head, next;
> > > > > > +
> > > > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > > > +					 &head, &next,
> > > > > > &free_entries);
> > > > > > +	if (n == 0)
> > > > > > +		goto end;
> > > > > > +
> > > > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > > > +
> > > > > > +	r->prod_64.tail += n;
> > > > > Don't we need release order when (or smp_wmb between) writing of
> > > > > the ring pointers and the update of tail? By updating the tail
> > > > > pointer, we are synchronising with a consumer.
> > > > >
> > > > > I prefer using __atomic operations even for load and store. You
> > > > > can see which parts of the code that synchronise with each other, e.g.
> > > > > store-release to some location synchronises with load-acquire
> > > > > from the same location. If you don't know how different threads
> > > > > synchronise with each other, you are very likely to make mistakes.
> > > > >
> > > > You can tell this code was written when I thought x86-64 was the
> > > > only viable target :). Yes, you are correct.
> > > >
> > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > a similar approach to the functions duplicated in
> > > > rte_ring_generic.h and
> > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > (and thus stricter memory ordering) and one that uses __atomic
> > > > intrinsics (and thus can benefit from more relaxed memory ordering).
> From a code point of view, I strongly prefer the atomic operations to be visible
> in the top level code, not hidden in subroutines. For correctness, it is vital that
> memory accesses are performed with the required ordering and that acquire and
> release matches up. Hiding e.g. load-acquire and store-release in subroutines (in
> a different file!) make this difficult. There have already been such bugs found in
> rte_ring.
> 

After working on the acq/rel ordering this weekend, I agree. This'll be easier/cleaner if we end up only using the C11 version.

> > > What's the advantage of having two different implementations? What
> > > is the disadvantage?
> > >
> > > The existing ring buffer code originally had only the "legacy"
> > > implementation
> > > which was kept when the __atomic implementation was added. The
> > > reason claimed was that some older compilers for x86 do not support
> > > GCC __atomic builtins. But I thought there was consensus that new
> > > functionality could have only __atomic implementations.
> > >
> > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > disabled for thunderx[1] for performance reasons. Assuming that hasn't
> > changed, the advantage to having two versions is to best support all of DPDK's
> platforms.
> > The disadvantage is of course duplicated code and the additional
> > maintenance burden.
> The only way I see that a C11 memory model implementation can be slower
> than using smp_wmb/rmb is if you need to order loads before a synchronizing
> store and there are also outstanding stores which do not require ordering.
> smp_rmb() handles this while store-release will also (unnecessarily) order those
> outstanding stores. This situation occurs e.g. in ring buffer dequeue operations
> where ring slots are read (and possibly written to thread-private memory) before
> the ring slots are release (e.g. using CAS-release or store-release).
> 
> I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have
> something to do with this problem. If there are a large amounts of stores
> pending in the load/store unit, store-release might have to wait for a long time
> before the synchronizing store can complete.
> 
> >
> > That said, if the thunderx maintainers are ok with it, I'm certainly
> > open to only doing the __atomic version. Note that even in the
> > __atomic version, based on Honnapa's findings[2], using a DPDK-defined
> > rte_atomic128_cmpset() (with additional arguments to support machines
> > with weak consistency) appears to be a better option than
> __atomic_compare_exchange_16.
> __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not
> lock-free on ARM/AArch64 and the support in GCC is formally broken (can't use
> cmpexchg16b to implement __atomic_load_16).
> 
> So yes, I think DPDK will have to define and implement the 128-bit atomic
> compare and exchange operation (whatever it will be called). For compatibility
> with ARMv8.0, we can't require the "old" value returned by a failed compare-
> exchange operation to be read atomically (LDXP does not guaranteed atomicity
> by itself). But this is seldom a problem, many designs read the memory location
> using two separate 64-bit loads (so not atomic) anyway, it is a successful atomic
> compare exchange operation which provides atomicity.
> 

Ok. I agree, I don't expect that to be a problem. The 128-bit CAS patch I just submitted[1] (which was developed before reading this) will have to be changed.

[1] http://mails.dpdk.org/archives/dev/2019-January/124159.html

Thanks,
Gage

</snip>


More information about the dev mailing list