[PATCH 1/1] ring: safe partial ordering for head/tail update
    Ola Liljedahl 
    Ola.Liljedahl at arm.com
       
    Tue Sep 23 23:57:32 CEST 2025
    
    
  
>
>
> On 2025-09-20, 14:01, "Konstantin Ananyev" <konstantin.ananyev at huawei.com <mailto:konstantin.ananyev at huawei.com>> wrote:
>
>
>
>
> > >
> > > To avoid information loss I combined reply to two Wathsala replies into one.
> > >
> > >
> > > > > > The function __rte_ring_headtail_move_head() assumes that the
> > > > > > barrier
> > > > > (fence) between the load of the head and the load-acquire of the
> > > > > > opposing tail guarantees the following: if a first thread reads
> > > > > > tail
> > > > > > and then writes head and a second thread reads the new value of
> > > > > > head
> > > > > > and then reads tail, then it should observe the same (or a later)
> > > > > > value of tail.
> > > > > >
> > > > > > This assumption is incorrect under the C11 memory model. If the
> > > > > > barrier
> > > > > > (fence) is intended to establish a total ordering of ring
> > > > > > operations,
> > > > > > it fails to do so. Instead, the current implementation only
> > > > > > enforces a
> > > > > > partial ordering, which can lead to unsafe interleavings. In
> > > > > > particular,
> > > > > > some partial orders can cause underflows in free slot or available
> > > > > > element computations, potentially resulting in data corruption.
> > > > >
> > > > > Hmm... sounds exactly like the problem from the patch we discussed
> > > > > earlier that year:
> > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > konstantin.ananyev at huawei.com <mailto:konstantin.ananyev at huawei.com> <mailto:20250521111432.207936-4-
> > konstantin.ananyev at huawei.com <mailto:konstantin.ananyev at huawei.com>>/
> > > > > In two words:
> > > > > "... thread can see 'latest' 'cons.head' value, with 'previous' value
> > > > > for 'prod.tail' or visa-versa.
> > > > > In other words: 'cons.head' value depends on 'prod.tail', so before
> > > > > making latest 'cons.head'
> > > > > value visible to other threads, we need to ensure that latest
> > > > > 'prod.tail' is also visible."
> > > > > Is that the one?
> > >
> > >
> > > > Yes, the behavior occurs under RCpc (LDAPR) but not under RCsc (LDAR),
> > > > which is why we didn’t catch it earlier. A fuller explanation, with
> > > > Herd7 simulations, is in the blog post linked in the cover letter.
> > > >
> > > > https://community.arm.com/arm-community-blogs/b/architectures-and- <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order
> > <https://community.arm.com/arm-community-blogs/b/architectures-and- <https://community.arm.com/arm-community-blogs/b/architectures-and->
> > processors-blog/posts/when-a-barrier-does-not-block-the-pitfalls-of-partial-order>
> > >
> > >
> > > I see, so now it is reproducible with core rte_ring on real HW.
> > >
> > >
> > > > >
> > > > > > The issue manifests when a CPU first acts as a producer and later
> > > > > > as a
> > > > > > consumer. In this scenario, the barrier assumption may fail when
> > > > > > another
> > > > > > core takes the consumer role. A Herd7 litmus test in C11 can
> > > > > > demonstrate
> > > > > > this violation. The problem has not been widely observed so far
> > > > > > because:
> > > > > > (a) on strong memory models (e.g., x86-64) the assumption holds,
> > > > > > and
> > > > > > (b) on relaxed models with RCsc semantics the ordering is still
> > > > > > strong
> > > > > > enough to prevent hazards.
> > > > > > The problem becomes visible only on weaker models, when load-
> > > > > > acquire is
> > > > > > implemented with RCpc semantics (e.g. some AArch64 CPUs which
> > > > > > support
> > > > > > the LDAPR and LDAPUR instructions).
> > > > > >
> > > > > > Three possible solutions exist:
> > > > > > 1. Strengthen ordering by upgrading release/acquire semantics to
> > > > > > sequential consistency. This requires using seq-cst for
> > > > > > stores,
> > > > > > loads, and CAS operations. However, this approach introduces a
> > > > > > significant performance penalty on relaxed-memory
> > > > > > architectures.
> > > > > >
> > > > > > 2. Establish a safe partial order by enforcing a pair-wise
> > > > > > happens-before relationship between thread of same role by
> > > > > > changing
> > > > > > the CAS and the preceding load of the head by converting them
> > > > > > to
> > > > > > release and acquire respectively. This approach makes the
> > > > > > original
> > > > > > barrier assumption unnecessary and allows its removal.
> > > > >
> > > > > For the sake of clarity, can you outline what would be exact code
> > > > > changes for
> > > > > approach #2? Same as in that patch:
> > > > > https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4- <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->
> > <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4-> <https://patchwork.dpdk.org/project/dpdk/patch/20250521111432.207936-4->>
> > > > konstantin.ananyev at huawei.com <mailto:konstantin.ananyev at huawei.com> <mailto:konstantin.ananyev at huawei.com <mailto:konstantin.ananyev at huawei.com>>/
> > > > > Or something different?
> > > >
> > > > Sorry, I missed the later half you your comment before.
> > > > Yes, you have proposed the same solution there.
> > >
> > >
> > > Ok, thanks for confirmation.
> > >
> > >
> > > > >
> > > > >
> > > > > > 3. Retain partial ordering but ensure only safe partial orders
> > > > > > are
> > > > > > committed. This can be done by detecting underflow conditions
> > > > > > (producer < consumer) and quashing the update in such cases.
> > > > > > This approach makes the original barrier assumption
> > > > > > unnecessary
> > > > > > and allows its removal.
> > > > >
> > > > > > This patch implements solution (3) for performance reasons.
> > > > > >
> > > > > > Signed-off-by: Wathsala Vithanage <wathsala.vithanage at arm.com <mailto:wathsala.vithanage at arm.com>
> > <mailto:wathsala.vithanage at arm.com <mailto:wathsala.vithanage at arm.com>>>
> > > > > > Signed-off-by: Ola Liljedahl <ola.liljedahl at arm.com <mailto:ola.liljedahl at arm.com>
> > <mailto:ola.liljedahl at arm.com <mailto:ola.liljedahl at arm.com>>>
> > > > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com <mailto:honnappa.nagarahalli at arm.com>
> > <mailto:honnappa.nagarahalli at arm.com <mailto:honnappa.nagarahalli at arm.com>>>
> > > > > > Reviewed-by: Dhruv Tripathi <dhruv.tripathi at arm.com <mailto:dhruv.tripathi at arm.com>
> > <mailto:dhruv.tripathi at arm.com <mailto:dhruv.tripathi at arm.com>>>
> > > > > > ---
> > > > > > lib/ring/rte_ring_c11_pvt.h | 10 +++++++---
> > > > > > 1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h
> > > > > > b/lib/ring/rte_ring_c11_pvt.h
> > > > > > index b9388af0da..e5ac1f6b9e 100644
> > > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > > @@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct
> > > > > > rte_ring_headtail
> > > > > > *d,
> > > > > > /* Reset n to the initial burst count */
> > > > > > n = max;
> > > > > >
> > > > > > - /* Ensure the head is read before tail */
> > > > > > - rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > > -
> > > > > > /* load-acquire synchronize with store-release of
> > > > > > ht->tail
> > > > > > * in update_tail.
> > > > > > */
> > > > >
> > > > > But then cons.head can be read a before prod.tail (and visa-versa),
> > > > > right?
> > > >
> > > > Right, we let it happen but eliminate any resulting states that are
> > > > semantically incorrect at the end.
> > >
> > >
> > > Two comments here:
> > > 1) I think it is probably safer to do the check like that:
> > > If (*entries > ring->capacity) ...
> > Yes, this might be another way of handling underflow situations. We could study
> > this.
> >
> > I have used the check for negative without problems in my ring buffer
> > implementations
> > https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c <https://github.com/ARM-software/progress64/blob/master/src/p64_ringbuf.c>
> > but can't say that has been battle-tested.
>
>
> My thought was about the case (probably hypothetical) when the difference
> between stale tail and head will be bigger then 2^31 + 1.
>
>
> > > 2) My concern that without forcing a proper read ordering
> > > (cons.head first then prod.tail) we re-introduce a window for all sorts of
> > > ABA-like problems.
> > Head and tail indexes are monotonically increasing so I don't see a risk for ABA-like
> > problems.
>
>
> I understand that, but with current CPU speeds it can take rte_ring just few seconds to
> wrap around head/tail values. If user doing something really fancy - like using rte_ring ZC API
> (i.e. just moving head/tail without reading actual objects) that can probably happen even
> faster (less than a second?).
> Are we sure that the stale tail value will never persist that long?
> Let say user calling move_head() in a loop till it succeeds?
>
>
> > Indeed, adding a monotonically increasing tag to pointers is the common way of
> > avoiding ABA
> > problems in lock-free designs.
>
>
> Yep, using 64-bit values for head/tail counters will help to avoid these concerns.
> But it will probably break HTS/RTS modes, plus it is an ABI change for sure.
>
>
> Actually after another thought, I have one more concern here:
>
>
> + /*
> + * Ensure the entries calculation was not based on a stale
> + * and unsafe stail observation that causes underflow.
> + */
> + if ((int)*entries < 0)
> + *entries = 0;
> +
>
>
> With that change, it might return not-valid information back to the user
> about number of free/occupied entries in the ring.
> Plus rte_ring_enqueue() now might fail even when there are enough free entries
> in the ring (same for dequeue).
How do you (or the thread) know there are enough free (or used) entries? Do you
assume sequentially consistent behaviour (a total order of memory accesses)?
Otherwise, you would need to explicitly create a happens-before relation
between threads, e.g. a consumer which made room in the ring buffer must
synchronize-with the producer that there is now room for more elements. That
synchronize-with edge will ensure the producer reads a fresh value of stail. But
without it, how can a thread know the state of the ring buffer that is being
manipulated by another thread?
> That looks like a change in our public API behavior that might break many things.
> There are quite few places when caller expects enqueue/dequeue
> operation to always succeed (let say there always should be enough free space in the ring).
Single-threaded scenarios are not a problem. Do you have a multithreaded scenario where
the caller expects enqueue/dequeue to always succeed? How are the threads involved in such
a scenario synchronizing with each other?
> For example: rte_mempool works like that.
> I am pretty sure there are quite few other places like that inside DPDK,
> not to mention third-party code.
>
>
> Considering all of the above, I am actually more in favor
> to combine approaches #2 and #3 for the final patch:
> establish a safe partial order (#2) and keep the check from #3 (should it become an assert()/verify()?)
I agree that using acquire/release for all prod/cons_head accesses will make it easier to
reason about the ring buffer state. Sequential consistency (total order) is the easiest to
reason about and often seems to be desired and expected by programmers (e.g. "I'll just
add a barrier here to ensure A happens before B in this thread, now there is a total order...").
- Ola
>
>
> Another thing to note: whatever final approach we choose -
> we need to make sure that the problem is addressed across all other
> rte_ring flavors/modes too (generic implementation, rts/hts mode, soring).
>
>
> Konstantin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
    
    
More information about the dev
mailing list