[dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation

Eads, Gage gage.eads at intel.com
Fri Jan 25 18:21:27 CET 2019



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> Sent: Wednesday, January 23, 2019 4:16 AM
> To: Eads, Gage <gage.eads at intel.com>; dev at dpdk.org
> Cc: olivier.matz at 6wind.com; stephen at networkplumber.org; nd
> <nd at arm.com>; Richardson, Bruce <bruce.richardson at intel.com>;
> arybchenko at solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev at intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
> 
> On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > Hi Ola,
> >
> > <snip>
> >
> > >
> > > >
> > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > rte_ring *r);
> > > >  #endif
> > > >  #include "rte_ring_generic_64.h"
> > > >
> > > > +/* @internal 128-bit structure used by the non-blocking ring */
> > > > +struct nb_ring_entry {
> > > > +	void *ptr; /**< Data pointer */
> > > > +	uint64_t cnt; /**< Modification counter */
> > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > also be supported. I think there are some claims that DPDK still supports e.g.
> > > ARMv7a
> > > and possibly also 32-bit x86?
> > I chose a 64-bit modification counter because (practically speaking)
> > the ABA problem will not occur with such a large counter -- definitely
> > not within my lifetime. See the "Discussion" section of the commit
> > message for more information.
> >
> > With a 32-bit counter, there is a very (very) low likelihood of it,
> > but it is possible. Personally, I don't feel comfortable providing
> > such code, because a) I doubt all users would understand the
> > implementation well enough to do the risk/reward analysis, and b) such
> > a bug would be near impossible to reproduce and root-cause if it did occur.
> With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. ARMv7a and
> probably x86 as well) won't be able to support this as they at best support 64-bit
> CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit (and
> 128-bit CAS) requirement on the implementation.
> 

Yes, I am. I tried to make that clear in the cover letter.

> >
> > >
> > >
> > > >
> > > > +};
> > > > +
> > > > +/* The non-blocking ring algorithm is based on the original rte
> > > > +ring (derived
> > > > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > > > +non-blocking
> > > > + * concurrent queue.
> > > > + */
> > > > +
> > > > +/**
> > > > + * @internal
> > > > + *   Enqueue several objects on the non-blocking ring
> > > > +(single-producer only)
> > > > + *
> > > > + * @param r
> > > > + *   A pointer to the ring structure.
> > > > + * @param obj_table
> > > > + *   A pointer to a table of void * pointers (objects).
> > > > + * @param n
> > > > + *   The number of objects to add in the ring from the obj_table.
> > > > + * @param behavior
> > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > +the ring
> > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > +to the ring
> > > > + * @param free_space
> > > > + *   returns the amount of space after the enqueue operation has
> > > > +finished
> > > > + * @return
> > > > + *   Actual number of objects enqueued.
> > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> > > > +			    unsigned int n,
> > > > +			    enum rte_ring_queue_behavior behavior,
> > > > +			    unsigned int *free_space)
> > > > +{
> > > > +	uint32_t free_entries;
> > > > +	size_t head, next;
> > > > +
> > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > +					 &head, &next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > +
> > > > +	r->prod_64.tail += n;
> > > Don't we need release order when (or smp_wmb between) writing of the
> > > ring pointers and the update of tail? By updating the tail pointer,
> > > we are synchronising with a consumer.
> > >
> > > I prefer using __atomic operations even for load and store. You can
> > > see which parts of the code that synchronise with each other, e.g.
> > > store-release to some location synchronises with load-acquire from
> > > the same location. If you don't know how different threads
> > > synchronise with each other, you are very likely to make mistakes.
> > >
> > You can tell this code was written when I thought x86-64 was the only
> > viable target :). Yes, you are correct.
> >
> > With regards to using __atomic intrinsics, I'm planning on taking a
> > similar approach to the functions duplicated in rte_ring_generic.h and
> > rte_ring_c11_mem.h: one version that uses rte_atomic functions (and
> > thus stricter memory ordering) and one that uses __atomic intrinsics
> > (and thus can benefit from more relaxed memory ordering).
> What's the advantage of having two different implementations? What is the
> disadvantage?
> 
> The existing ring buffer code originally had only the "legacy" implementation
> which was kept when the __atomic implementation was added. The reason
> claimed was that some older compilers for x86 do not support GCC __atomic
> builtins. But I thought there was consensus that new functionality could have
> only __atomic implementations.
> 

When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left disabled for thunderx[1] for performance reasons. Assuming that hasn't changed, the advantage to having two versions is to best support all of DPDK's platforms. The disadvantage is of course duplicated code and the additional maintenance burden.

That said, if the thunderx maintainers are ok with it, I'm certainly open to only doing the __atomic version. Note that even in the __atomic version, based on Honnapa's findings[2], using a DPDK-defined rte_atomic128_cmpset() (with additional arguments to support machines with weak consistency) appears to be a better option than __atomic_compare_exchange_16.

I couldn't find the discussion about new functionality using __atomic going forward -- can you send a link?

[1] https://mails.dpdk.org/archives/dev/2017-December/082853.html
[2] http://mails.dpdk.org/archives/dev/2019-January/124002.html

> Does the non-blocking ring buffer implementation have to support these older
> compilers? Will the applications that require these older compiler be updated to
> utilise the non-blocking ring buffer?
> 

(See above -- compiler versions wasn't a consideration here.)

> >
> > >
> > > >
> > > > +
> > > > +end:
> > > > +	if (free_space != NULL)
> > > > +		*free_space = free_entries - n;
> > > > +	return n;
> > > > +}
> > > > +
> > > > +/**
> > > > + * @internal
> > > > + *   Enqueue several objects on the non-blocking ring
> > > > +(multi-producer
> > > > +safe)
> > > > + *
> > > > + * @param r
> > > > + *   A pointer to the ring structure.
> > > > + * @param obj_table
> > > > + *   A pointer to a table of void * pointers (objects).
> > > > + * @param n
> > > > + *   The number of objects to add in the ring from the obj_table.
> > > > + * @param behavior
> > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > +the ring
> > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > +to the ring
> > > > + * @param free_space
> > > > + *   returns the amount of space after the enqueue operation has
> > > > +finished
> > > > + * @return
> > > > + *   Actual number of objects enqueued.
> > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const
> *obj_table,
> > > > +			    unsigned int n,
> > > > +			    enum rte_ring_queue_behavior behavior,
> > > > +			    unsigned int *free_space)
> > > > +{
> > > > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > > > +	RTE_SET_USED(r);
> > > > +	RTE_SET_USED(obj_table);
> > > > +	RTE_SET_USED(n);
> > > > +	RTE_SET_USED(behavior);
> > > > +	RTE_SET_USED(free_space);
> > > > +#ifndef ALLOW_EXPERIMENTAL_API
> > > > +	printf("[%s()] RING_F_NB requires an experimental API."
> > > > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > > > +	       , __func__);
> > > > +#endif
> > > > +	return 0;
> > > > +#endif
> > > > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > > > +	size_t head, next, tail;
> > > > +	uint32_t free_entries;
> > > > +	unsigned int i;
> > > > +
> > > > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > > > +					 &head, &next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > > > +		struct nb_ring_entry old_value, new_value;
> > > > +		struct nb_ring_entry *ring_ptr;
> > > > +
> > > > +		/* Enqueue to the tail entry. If another thread wins the
> > > > race,
> > > > +		 * retry with the new tail.
> > > > +		 */
> > > > +		tail = r->prod_64.tail;
> > > > +
> > > > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r-
> > > > >mask];
> > > This is an ugly expression and cast. Also I think it is unnecessary.
> > > What's preventing this from being written without a cast? Perhaps
> > > the ring array needs to be a union of "void *" and struct
> > > nb_ring_entry?
> > The cast is necessary for the correct pointer arithmetic (let
> > "uintptr_t base == &r[1]"):
> Yes I know the C language.
> 
> > - With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail &
> > r-
> > >mask);
> > - W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail &
> > r->mask);
> >
> > FWIW, this is essentially the same as is done with the second argument
> > (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across
> > multiple lines of code. The equivalent here would be:
> >
> > struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
> > ring_ptr = ring_base[tail & r->mask];
> >
> > Which is more legible, I think.
> The RTE ring buffer code is not very legible to start with.
> 
> >
> > There is no ring array structure in which to add a union; the ring
> > array is a contiguous chunk of memory that immediately follows after
> > the end of a struct rte_ring. We interpret the memory there according
> > to the ring entry data type (void * for regular rings and struct nb_ring_entry for
> non-blocking rings).
> My worry is that we are abusing the C language and creating a monster of
> fragile C code that will be more and more difficult to understand and to
> maintain. At some point you have to think the question "Are we doing the right
> thing?".
>

I'm not aware of any fragility/maintainability issues in the ring code (though perhaps the maintainers have a different view!), and personally I find the code fairly legible. If you have a specific suggestion, I'll look into incorporating it.

Thanks,
Gage

</snip>


More information about the dev mailing list