[dpdk-dev] [RFC] lfring: lock-free ring buffer

Eads, Gage gage.eads at intel.com
Mon Jan 28 22:04:46 CET 2019


Hey Ola,

> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl at arm.com]
> Sent: Monday, January 28, 2019 6:29 AM
> To: dev at dpdk.org; Eads, Gage <gage.eads at intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli at arm.com>; Richardson, Bruce
> <bruce.richardson at intel.com>
> Cc: nd <nd at arm.com>; Ola Liljedahl <Ola.Liljedahl at arm.com>
> Subject: [RFC] lfring: lock-free ring buffer
> 
> Lock-free MP/MC ring buffer with SP/SC options.
> The ring buffer metadata only maintains one set of head and tail pointers. Tail is
> updated on enqueue, head is updated on dequeue.
> The ring slots between head and tail always contains valid (unconsumed) slots.
> Each ring slot consists of a struct of data pointer ("ptr") and ring index ("idx").
> Slot.idx is only updated on enqueue with the new ring index. The slot index is
> used to ensure that slots are written sequentially (important for FIFO ordering).
> It is also used to synchronise multiple producers.
> 
> MP enqueue scans the ring from the tail until head+size, looking for an empty
> slot as indicated by slot.idx equaling the ring index of the previous lap (index -
> ring size). The empty slot is written (.ptr = data, .idx = index) using CAS. If CAS
> fails, some other thread wrote the slot and the thread continues with the next
> slot.
> 
> When all elements have been enqueued or if the ring buffer is full, the thread
> updates the ring buffer tail pointer (unless this has not already been updated by
> some other thread). Thus this thread will help other threads that have written
> ring slots but not yet updated the tail pointer.
> 
> If a slot with an index different from current or previous lap is found, it is
> assumed that the thread has fallen behind (e.g. been preempted) and the local
> tail pointer is (conditionally) updated from the ring buffer metadata in order to
> quickly catch up.
> 
> MP dequeue reads (up to) N slots from the head index and attempts to commit
> the operation using CAS on the head pointer.
> 
> Enqueue N elements: N+1 CAS operations (but the last CAS operation might not
> be necessary during contention as producers will help each other) Dequeue N
> elements: 1 CAS operation
> 
> As the lock-free ring supports both MT-safe (MP/MC) and single-threaded
> (SP/SC) operations, there is a question whether the modes should be chosen
> only when creating a ring buffer or if the application can mix MT-safe and single-
> threaded enqueue (or dequeue) calls. To follow the pattern set by rte_ring,
> specific functions for MP and SP enqueue and MC and SC dequeue are made
> available. The wisdom of this ability can be questioned. The lock-free design
> allows threads to be blocked for long periods of time without interfering with
> the operation of the ring buffer but mixing (lock-free) MT-safe and single-
> threaded calls requires that there are on such threads that wake up at the wrong
> moment (when a single-threaded operation is in progress).
> 

I see this as a user error. The rte_ring documentation is clear that ring_sp_enqueue and ring_sc_dequeue functions are not MT-safe, and (if I'm reading this correctly) this scenario involves having 2+ threads executing an operation in parallel.

> 128-bit lock-free atomic compare exchange operations for AArch64
> (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these functions to
> find a different home and possibly a different API.
> Note that a 'frail' version atomic_compare_exchange is implemented, this
> means that in the case of failure, the old value returned is not guaranteed to be
> atomically read (as this is not possible on ARMv8.0 without also writing to the
> location). Atomically read values are often not necessary, it is a successful
> compare and exchange operation which provides atomicity.
> 
> Signed-off-by: Ola Liljedahl <ola.liljedahl at arm.com>

A few high-level thoughts:
1. The enqueue and dequeue functions are not compatible with the rte_ring's bulk semantics, where either all pointers or none are operated on. This is necessary to implement the ring API of course, but in a broader sense I think it's necessary for usability in general. For example, how would a user (or the ring mempool handler) distinguish between partial dequeues that occur because the ring is empty vs. those that occur due to a race? dequeue_bulk_mc could return 0 even though the ring is full, if its first CAS fails and the head has advanced beyond its local tail variable. The user could retry dequeue, of course, but how many times until they feel confident that the ring is truly empty and give up? I don't think we can avoid reserving ring slots prior to enqueueing/dequeueing.

2. On the plus side, the enqueue function design that allows it to use 1/2 the atomics of mine appears to be independent of reserving ring slots, and should transfer over fairly cleanly. I'm a little concerned about the performance variability it introduces (i.e. if the thread gets into "catch up" mode), particularly for larger rings, since real-time software values predictability. What if the reload criteria was instead something like:

#define ENQ_RETRY_LIMIT 32 //arbitrary

if (old.e.idx != tail - size) {
    if (++fail_cnt < ENQ_RETRY_LIMIT) {
        tail++;
    } else {
        fail_cnt = 0;
        tail = rte_lfring_reload(...); 
    }
    goto restart;
}
fail_cnt = 0;

3. Using a zero-length array to mark the start of the ring is a nice approach -- I'll incorporate that into the patchset.

At an algorithm level, I don't see any other differences. Implementation-wise, we'll need to nail the memory ordering flags to best support weak consistency machines, as you pointed out elsewhere.

Out of curiosity, is p64_lfring based on the nb ring patchset? I noticed it used a different algorithm until pretty recently.

Thanks,
Gage

</snip>


More information about the dev mailing list