[dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Sun Mar 28 03:00:17 CEST 2021

Previous message (by thread): [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
Next message (by thread): [dpdk-dev] [PATCH] eal/rwlocks: Try read/write and relock write to read locks added.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

<snip>

> Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] rte_ring: fix racy
> dequeue/enqueue in ppc64
> 
> No reply after more than 2 years.
> Unfortunately it is probably outdated now.
> Classified as "Changes Requested".
Looking at the code, I think this patch in fact fixes a bug. Appreciate rebasing this patch.

The problem is already fixed in '__rte_ring_move_cons_head' but needs to be fixed in '__rte_ring_move_prod_head'.
This problem is fixed for C11 version due to acquire load of cons.tail and prod.tail.

> 
> 
> 17/07/2018 05:34, Jerin Jacob:
> > From: Takeshi Yoshimura <t.yoshimura8869 at gmail.com>
> >
> > Cc: olivier.matz at 6wind.com
> > Cc: chaozhu at linux.vnet.ibm.com
> > Cc: konstantin.ananyev at intel.com
> >
> > >
> > > > Adding rte_smp_rmb() cause performance regression on non x86
> platforms.
> > > > Having said that, load-load barrier can be expressed very  well
> > > > with C11 memory model. I guess ppc64 supports C11 memory model. If
> > > > so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64
> > > > and check original issue?
> > >
> > > Yes, the performance regression happens on non-x86 with single
> > > producer/consumer.
> > > The average latency of an enqueue was increased from 21 nsec to 24
> > > nsec in my simple experiment. But, I think it is worth it.
> >
> > That varies to machine to machine. What is the burst size etc.
> >
> > >
> > >
> > > I also tested C11 rte_ring, however, it caused the same race condition in
> ppc64.
> > > I tried to fix the C11 problem as well, but I also found the C11
> > > rte_ring had other potential incorrect choices of memory orders,
> > > which caused another race condition in ppc64.
> >
> > Does it happens on all ppc64 machines? Or on a specific machine?
> > Is following tests are passing on your system without the patch?
> > test/test/test_ring_perf.c
> > test/test/test_ring.c
> >
> > >
> > > For example,
> > > __ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but I
> > > am not sure why the load-acquire is used for the compare exchange.
> >
> > It correct as per C11 acquire and release semantics.
> >
> > > Also in update_tail, the pause can be called before the data copy
> > > because of ht->tail load without atomic_load_n.
> > >
> > > The memory order is simply difficult, so it might take a bit longer
> > > time to check if the code is correct. I think I can fix the C11
> > > rte_ring as another patch.
> > >
> > > >>
> > > >> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
> > > >> It uses a single consumer and multiple producers for a rte_ring.
> > > >> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
> > > >
> > > > Adding rte_smp_rmb() cause performance regression on non x86
> platforms.
> > > > Having said that, load-load barrier can be expressed very  well
> > > > with C11 memory model. I guess ppc64 supports C11 memory model. If
> > > > so, Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64
> > > > and check original issue?
> > > >
> > > >>
> > > >> The reordered loads happened on r->prod.tail in
> >
> > There is rte_smp_rmb() just before reading r->prod.tail in
> >         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > _rte_ring_move_cons_head(). Would that not suffice the requirement?
> >
> > Can you check adding compiler barrier and see is compiler is
> > reordering the stuff?
> >
> > DPDK's ring implementation is based freebsd's ring implementation, I
> > don't see need for such barrier
> >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
> >
> > If it is something specific to ppc64 or a specific ppc64 machine, we
> > could add a compile option as it is arch specific to avoid performance
> > impact on other architectures.
> >
> > > >> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
> > > >> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
> > > >> dependency, but the code does not satisfy it. Note that they are
> > > >> not reordered if __rte_ring_move_cons_head() with is_sc != 1
> > > >> because cmpset invokes a read barrier.
> > > >>
> > > >> The paired stores on these loads are in ENQUEUE_PTRS() and
> > > >> update_tail(). Simplified code around the reorder is the following.
> > > >>
> > > >> Consumer             Producer
> > > >> load idx[ring]
> > > >>                      store idx[ring]
> > > >>                      store r->prod.tail load r->prod.tail
> > > >>
> > > >> In this case, the consumer loads old idx[ring] and confirms the
> > > >> load is valid with the new r->prod.tail.
> > > >>
> > > >> I added a read barrier in the case where __IS_SC is passed to
> > > >> __rte_ring_move_cons_head(). I also fixed
> > > >> __rte_ring_move_prod_head() to avoid similar problems with a single
> producer.
> > > >>
> > > >> Cc: stable at dpdk.org
> > > >>
> > > >> Signed-off-by: Takeshi Yoshimura <tyos at jp.ibm.com>
> > > >> ---
> > > >>  lib/librte_ring/rte_ring_generic.h | 10 ++++++----
> > > >>  1 file changed, 6 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/lib/librte_ring/rte_ring_generic.h
> > > >> b/lib/librte_ring/rte_ring_generic.h
> > > >> index ea7dbe5b9..477326180 100644
> > > >> --- a/lib/librte_ring/rte_ring_generic.h
> > > >> +++ b/lib/librte_ring/rte_ring_generic.h
> > > >> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
> > > >>                         return 0;
> > > >>
> > > >>                 *new_head = *old_head + n;
> > > >> -               if (is_sp)
> > > >> +               if (is_sp) {
> > > >> +                       rte_smp_rmb();
> > > >>                         r->prod.head = *new_head, success = 1;
> > > >> -               else
> > > >> +               } else
> > > >>                         success = rte_atomic32_cmpset(&r->prod.head,
> > > >>                                         *old_head, *new_head);
> > > >>         } while (unlikely(success == 0)); @@ -158,9 +159,10 @@
> > > >> __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
> > > >>                         return 0;
> > > >>
> > > >>                 *new_head = *old_head + n;
> > > >> -               if (is_sc)
> > > >> +               if (is_sc) {
> > > >> +                       rte_smp_rmb();
> > > >>                         r->cons.head = *new_head, success = 1;
> > > >> -               else
> > > >> +               } else
> > > >>                         success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> > > >>                                         *new_head);
> > > >>         } while (unlikely(success == 0));
> > > >> --
> > > >> 2.17.1
> 
>

Previous message (by thread): [dpdk-dev] [dpdk-stable] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
Next message (by thread): [dpdk-dev] [PATCH] eal/rwlocks: Try read/write and relock write to read locks added.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list