[dpdk-dev] [RFC] lfring: lock-free ring buffer

Eads, Gage gage.eads at intel.com
Wed Jan 30 06:17:18 CET 2019
Previous message: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Next message: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> Sent: Monday, January 28, 2019 4:26 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; Richardson,
> Bruce <bruce.richardson at intel.com>; Eads, Gage <gage.eads at intel.com>;
> dev at dpdk.org
> Cc: nd <nd at arm.com>
> Subject: Re: [RFC] lfring: lock-free ring buffer
> 
> On Mon, 2019-01-28 at 21:04 +0000, Eads, Gage wrote:
> > Hey Ola,
> >
> > >
> > > -----Original Message-----
> > > From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> > > Sent: Monday, January 28, 2019 6:29 AM
> > > To: dev at dpdk.org; Eads, Gage <gage.eads at intel.com>; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli at arm.com>; Richardson, Bruce
> > > <bruce.richardson at intel.com>
> > > Cc: nd <nd at arm.com>; Ola Liljedahl <Ola.Liljedahl at arm.com>
> > > Subject: [RFC] lfring: lock-free ring buffer
> > >
> > > Lock-free MP/MC ring buffer with SP/SC options.
> > > The ring buffer metadata only maintains one set of head and tail pointers.
> > > Tail is
> > > updated on enqueue, head is updated on dequeue.
> > > The ring slots between head and tail always contains valid
> > > (unconsumed) slots.
> > > Each ring slot consists of a struct of data pointer ("ptr") and ring
> > > index ("idx").
> > > Slot.idx is only updated on enqueue with the new ring index. The
> > > slot index is used to ensure that slots are written sequentially
> > > (important for FIFO ordering).
> > > It is also used to synchronise multiple producers.
> > >
> > > MP enqueue scans the ring from the tail until head+size, looking for
> > > an empty slot as indicated by slot.idx equaling the ring index of
> > > the previous lap (index - ring size). The empty slot is written
> > > (.ptr = data, .idx = index) using CAS.
> > > If CAS
> > > fails, some other thread wrote the slot and the thread continues
> > > with the next slot.
> > >
> > > When all elements have been enqueued or if the ring buffer is full,
> > > the thread updates the ring buffer tail pointer (unless this has not
> > > already been updated by some other thread). Thus this thread will
> > > help other threads that have written ring slots but not yet updated
> > > the tail pointer.
> > >
> > > If a slot with an index different from current or previous lap is
> > > found, it is assumed that the thread has fallen behind (e.g. been
> > > preempted) and the local tail pointer is (conditionally) updated
> > > from the ring buffer metadata in order to quickly catch up.
> > >
> > > MP dequeue reads (up to) N slots from the head index and attempts to
> > > commit the operation using CAS on the head pointer.
> > >
> > > Enqueue N elements: N+1 CAS operations (but the last CAS operation
> > > might not be necessary during contention as producers will help each
> > > other) Dequeue N
> > > elements: 1 CAS operation
> > >
> > > As the lock-free ring supports both MT-safe (MP/MC) and
> > > single-threaded
> > > (SP/SC) operations, there is a question whether the modes should be
> > > chosen only when creating a ring buffer or if the application can
> > > mix MT-safe and
> > > single-
> > > threaded enqueue (or dequeue) calls. To follow the pattern set by
> > > rte_ring, specific functions for MP and SP enqueue and MC and SC
> > > dequeue are made available. The wisdom of this ability can be
> > > questioned. The lock-free design allows threads to be blocked for
> > > long periods of time without interfering with the operation of the
> > > ring buffer but mixing (lock-free) MT-safe and single- threaded
> > > calls requires that there are on such threads that wake up at the
> > > wrong moment (when a single-threaded operation is in progress).
> > >
> > I see this as a user error. The rte_ring documentation is clear that
> > ring_sp_enqueue and ring_sc_dequeue functions are not MT-safe, and (if
> > I'm reading this correctly) this scenario involves having 2+ threads
> > executing an operation in parallel.
> So the individual MP/SP/MC/SC functions should also be directly available to the
> user? This is what I have implemented here (but not in my original PROGRESS64
> implementation where MP/SP/MC/SC mode is fixed at creation). But I don't like
> it.
> 
> >
> > >
> > > 128-bit lock-free atomic compare exchange operations for AArch64
> > > (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these
> > > functions to find a different home and possibly a different API.
> > > Note that a 'frail' version atomic_compare_exchange is implemented,
> > > this means that in the case of failure, the old value returned is
> > > not guaranteed to be atomically read (as this is not possible on
> > > ARMv8.0 without also writing to the location). Atomically read
> > > values are often not necessary, it is a successful compare and
> > > exchange operation which provides atomicity.
> > >
> > > Signed-off-by: Ola Liljedahl <ola.liljedahl at arm.com>
> > A few high-level thoughts:
> > 1. The enqueue and dequeue functions are not compatible with the
> > rte_ring's bulk semantics,
> I wasn't considering 100% compatibility with other modules that already use
> rte_ring, the code rather reflects the requirements of my own use cases. If we
> want this, then we need to provide the same bulk enqueue/dequeue behaviour
> as rte_ring.
> 
> >  where either all pointers or none are operated on. This is necessary
> > to implement the ring API of course, but in a broader sense I think
> > it's necessary for usability in general. For example, how would a user
> > (or the ring mempool handler) distinguish between partial dequeues
> > that occur because the ring is empty vs. those that occur due to a
> > race? dequeue_bulk_mc could return
> > 0 even though the ring is full, if its first CAS fails and the head
> > has advanced beyond its local tail variable.
> Tail could be re-read on a CAS(&head) failure. This is how I originally coded ring
> buffer enqueue and dequeue functions. But CAS(&head) failure doesn't by itself
> mean that *tail* has changed. So I hoisted the read of tail before the CAS-loop
> which made the code slightly faster since we avoid unnecessarily reading a
> location which might have been modified and thus could cause a cache miss.
> Maybe this microptimisation isn't always a good idea (it could lead to spurious
> queue full/empty failures). Another solution is to only re-read tail if we see the
> queue as empty (or we can't acquire as many slots as the user requested).
> 
>         uint64_t head = __atomic_load_n(&r->head, __ATOMIC_RELAXED);
>         uint64_t tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
>         do {
>                 actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
>                 if (unlikely(actual <= 0)) {
> 			/* Re-load tail only when queue looks empty */
>         		tail = __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE);
> 
>           actual = RTE_MIN((int64_t)(tail - head), (int64_t)nelems);
> 			if
> (unlikely(actual <= 0))
> 	                        return 0;
> 		}
>                 rte_lfring_deq_elems(elems, r->ring, mask, head, actual);
>         } while (!__atomic_compare_exchange_n(&r->head,
>                                               &head,
>                                               head + actual,
>                                               /*weak*/false,
>                                               __ATOMIC_RELEASE,
>                                               __ATOMIC_RELAXED));
> >  The user could retry dequeue, of course, but how many times until
> > they feel confident that the ring is truly empty and give up? I don't
> > think we can avoid reserving ring slots prior to enqueueing/dequeueing.
> If tail is (re-) read inside the CAS-loop, there should be no spurious queue empty
> failures and no need to repeatedly call the dequeue function "just to be sure".
> The bulk dequeue/enqueue behaviour is separate.
> 
> >
> > 2. On the plus side, the enqueue function design that allows it to use
> > 1/2 the atomics of mine appears to be independent of reserving ring
> > slots, and should transfer over fairly cleanly. I'm a little concerned
> > about the performance variability it introduces (i.e. if the thread
> > gets into "catch up" mode),
> If a thread has to catch up, it means it is using a stale (head/tail) index (more
> than one ring lap behaind). Better to try to load a fresh value (if one is
> available) than to iterate over the ring until it has caught up. So I think this is the
> better/faster design.
> 
> Catch up mode is not triggered by finding an occupied slot for the current lap
> (that was just written by some other thread). Or at least this is the idea. If we
> find a freshly written slot, we just move to the next slot.
> 
> > particularly for larger rings, since real-time software values predictability.
> > What if the reload criteria was instead something like:
> >
> > #define ENQ_RETRY_LIMIT 32 //arbitrary
> >
> > if (old.e.idx != tail - size) {
> >     if (++fail_cnt < ENQ_RETRY_LIMIT) {
> >         tail++;
> >     } else {
> >         fail_cnt = 0;
> >         tail = rte_lfring_reload(...);
> >     }
> >     goto restart;
> > }
> > fail_cnt = 0;
> There are three cases (slot index must be between q->tail and q->head + q-
> >size):
> slot.idx == tail - size: slot is free, try to write it slot.idx == tail: slot has just been
> written (by other thread), skip to next slot (tail++) none of the above: thread is
> behind (at least one lap), re-load tail from q-
> >tail
> 
> I think using the retry count actually delays catching up to a fresh position.
> 

Miscommunication here -- by "catch up", I meant the case where the thread is behind but by less than one lap (the second case you describe). In the worst case, the thread would have to read N-1 (N = ring size) entries before reaching the next available entry, and N could easily be in the thousands. That's the performance variability I was referring to, and why I suggested capping the failed slot reads at 32. Maintaining a local fail_cnt variable is a small price to pay (relative to a CAS) to prevent a ring enqueue latency spike.

But you're right that we should still catch the 1+ lap behind case, so the reload criteria could be:

#define ENQ_RETRY_LIMIT 32 //arbitrary

if (old.e.idx != tail - size) {
    if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx == tail) {
        tail++;
    } else {
        fail_cnt = 0;
        tail = rte_lfring_reload(...);
    }
    goto restart;
}
fail_cnt = 0;

> >
> > 3. Using a zero-length array to mark the start of the ring is a nice
> > approach
> > -- I'll incorporate that into the patchset.
> >
> > At an algorithm level, I don't see any other differences.
> > Implementation-wise, we'll need to nail the memory ordering flags to
> > best support weak consistency machines, as you pointed out elsewhere.
> There is no pre-acquisition of slots in enqueue and dequeue. That separate step
> makes lock-freedom impossible (I think).
> 

Can you elaborate? I don't currently see any other way to support rte_ring bulk semantics, which is necessary for creating a non-blocking mempool handler, so we should clear up any doubt.

In the NB ring patchset each thread reserves a number of slots before performing the enq/deq, but doesn't reserve *specific* slots (unlike rte_ring). This reservation is atomic, so that we never over-subscribe valid ring entries (for dequeue) or unused ring entries (for enqueue). This guarantees that the enq/deq operation will eventually complete, regardless of the behavior of other threads. This is why the enqueue loop doesn't check if space is available and the dequeue loop doesn't check if any more valid entries remain.

This sort of reservation should be compatible with lfring, but requires changes (e.g. two sets of head/tail pointers).

> With a merged implementation, I think the NON-BLOCKING/LOCK-FREE mode
> most have separate implementations also for SP/SC. In non-blocking/lock-free
> mode (whether MP/MC or SP/SC), only one set of head/tail pointers is used. In
> blocking mode, two sets of head/tail are used. There isn't a set of SP/SC code
> that supports both non-blocking/lock-free and blocking behaviours.
> 

Sure, I don't see a problem with that. The NB ring patchset has separate NB SP/SC code from the blocking SP/SC code as well.

> >
> > Out of curiosity, is p64_lfring based on the nb ring patchset? I
> > noticed it used a different algorithm until pretty recently.
> I created the original lfring design (which isn't that different I think) back in
> November last and it seemed to work but I didn't think it could guarantee FIFO
> order as elements were enqueued and dequeued wherever a thread found a
> suitable slot and in my use cases, FIFO order is an important characteristic (need
> to maintain ingress-to-egress packet order). Already then I was considering
> adding the index to the ring slots to enforce FIFO enqueue/dequeue order, I
> knew this would be possible on ARMv7a (which has 2x32 bit CAS using
> exclusives) and ARMv8/AArch64 (which has 2x64-bit CAS using exclusives). I had
> seen this design idea on the Internet before (e.g.
> here https://www.youtube.com/ watch?v=HP2InVqgBFM, I actually stopped
> viewing this presentation half-way because I wanted to iron out the details
> myself), it's not a novel idea (sorry).
> However I didn't have time to do this change before Christmas holidays and just
> came back to work two weeks ago. So I would rather call this simultaneous
> "invention" (or publication). Difficult to prove this of course...
> 
> >
> > Thanks,
> > Gage
> >
> > </snip>
> --
> Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
> ola.liljedahl
Previous message: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Next message: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list