rte_ring move head question for machines with relaxed MO (arm/ppc)

Wathsala Wathawana Vithanage wathsala.vithanage at arm.com
Fri Oct 11 02:11:16 CEST 2024
Previous message (by thread): rte_ring move head question for machines with relaxed MO (arm/ppc)
Next message (by thread): rte_ring move head question for machines with relaxed MO (arm/ppc)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev at huawei.com>
> Sent: Thursday, October 10, 2024 11:54 AM
> To: Wathsala Wathawana Vithanage <wathsala.vithanage at arm.com>;
> dev at dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>;
> jerinj at marvell.com; drc at linux.ibm.com; nd <nd at arm.com>; nd
> <nd at arm.com>
> Subject: RE: rte_ring move head question for machines with relaxed MO
> (arm/ppc)
> 
> 
> 
> > > > > > 1. rte_ring_generic_pvt.h:
> > > > > > =====================
> > > > > >
> > > > > > pseudo-c-code                                      //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > >  head.load()                                          //        ldr [head]
> > > > > >  rte_smp_rmb()                                    //        dmb ishld
> > > > > >  opposite_tail.load()                            //        ldr [opposite_tail]
> > > > > >  ...
> > > > > >  rte_atomic32_cmpset(head, ...)      //        ldrex[head];... stlex[head]
> > > > > >
> > > > > >
> > > > > > 2. rte_ring_c11_pvt.h
> > > > > > =====================
> > > > > >
> > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > > head.atomic_load(relaxed)                 //        ldr[head]
> > > > > > atomic_thread_fence(acquire)           //        dmb ish
> > > > > > opposite_tail.atomic_load(acquire)   //        lda[opposite_tail]
> > > > > > ...
> > > > > > head.atomic_cas(..., relaxed)              //        ldrex[haed]; ... strex[head]
> > > > > >
> > > > > >
> > > > > > 3.   rte_ring_hts_elem_pvt.h
> > > > > > ==========================
> > > > > >
> > > > > > pseudo-c-code                                       //        related armv8 instructions
> > > > > > --------------------                                                 ----------------------------------
> ----
> > > > > > head.atomic_load(acquire)                //        lda [head]
> > > > > > opposite_tail.load()                             //        ldr [opposite_tail]
> > > > > > ...
> > > > > > head.atomic_cas(..., acquire)            //         ldaex[head]; ... strex[head]
> > > > > >
> > > > > > The questions that arose from these observations:
> > > > > > a) are all 3 approaches equivalent in terms of functionality?
> > > > > Different, lda (Load with acquire semantics) and ldr (load) are different.
> > > >
> > > > I understand that, my question was:
> > > > lda {head]; ldr[tail]
> > > > vs
> > > > ldr [head]; dmb ishld; ldr [tail];
> > > >
> > > > Is there any difference in terms of functionality (memory ops
> > > ordering/observability)?
> > >
> > > To be more precise:
> > >
> > > lda {head]; ldr[tail]
> > > vs
> > > ldr [head]; dmb ishld; ldr [tail];
> > > vs
> > > ldr [head]; dmb ishld; lda [tail];
> > >
> > > what would be the difference between these 3 cases?
> >
> > Case A: lda {head]; ldr[tail]
> > load of the head will be observed by the memory subsystem before the
> > load of the tail.
> >
> > Case B: ldr [head]; dmb ishld; ldr [tail]; load of the head will be
> > observed by the memory subsystem Before the load of the tail.
> >
> >
> > Essentially both cases A and B are the same.
> > They preserve following program orders.
> > LOAD-LOAD
> > LOAD-STORE
> 
> Ok, that is crystal clear, thanks for explanation.
> 
> 
> > Case C: ldr [head]; dmb ishld; lda [tail]; load of the head will be
> > observed by the memory subsystem before the load of the tail.
> 
> Ok.
> 
> > In addition, any load or store program order after lda[tail] will not
> > be observed by the memory subsystem before the load of the tail.
> 
> Ok... the question is why we need that extra hoisting barrier here?
> From what unwanted  re-orderings we are protecting here?
> Does it mean that without it, ldrex/strex (CAS) can be reordered with
> load[cons.tail]?
> 
> Actually, we probably need to look at whole picture:
> 
> in rte_ring_generic_pvt.h
> =====================
> 
> ldr [prod.head]
> dmb ishld
> ldr [cons.tail]
> ...
> /* cas */
> ldrex [prod.head]
> stlex [prod.head]   /* sink barrier */
> 
> in rte_ring_c11_pvt.h
> =====================
> 
> ldr [prod.head]
> dmb ishld
> lda [cons.tail]          /* exrea hoist */
> ...
> /* cas */
> ldrex [prod.head]
> strex [prod.head]

Minor thing, ldrex and strex is how Arm 32 way of doing CAS.
Aaarch64 has a cas instruction.
Same code in aarch64 armv9-a https://godbolt.org/z/TMvWx6v4n


> 
> So, in _genereic_ we don't have that extra hoist barrier after load[con.tail], but
> we have extra sink barrier at cas(prod.tail).
> 

You are right, technically, that lda[cons.tail] is not required because due to the 
dependency chain up until CAS a memory reordering is not possible.
For that reason, it has no issue synchronizing with the strl[prod.tail] (in tail-update).
C11 standard calls it consume-memory-order (__ATOMIC_CONSUME in GCC).

So, ideally one could have written something like...
__atomic_load_n(prod.head, __ATOMIC_CONSUME);
instead of 
__atomic_load_n(prod.head, __ATOMIC_ACQUIRE);

The compiler is then supposed to figure out if there are any dependencies in
the code path to ensure that load of the prod.head is observed before any
load/store that's program order after the load of the prod.head.
If not, the compiler is supposed to add required barrier to ensure that order is 
preserved.

However, it's easier said than done. No, compiler implements it and C11 standard
discourages use of memory-order-consume for that reason.

This brings us to the next caveat. As per C11 standard, there cannot be a free standing
Store with release semantics (stlr)  that isn't paired with a load with acquire or consume
semantics. Since we can't use __ATOMIC_CONSUME (which would have resulted in 
ldr[prod.head]), we are forced to use __ATOMIC_ACQUIRE (which results in lda[prod.head]).


> If that's correct observation, can we change _c11_ implementation to match
> _generic_ one by:
> 
>  atomic_load(prod.head, releaxed);
>  atomic_thread_fence(acquire);
>  atomic_load(cons.tail, releaxed);
> ....
> atomic_cas(prod.head, release, relaxed); ?
> 
> From my understanding that should help to make these 2 implantations
> Identical, and then hopefully we can get rid of rte_ring_generic_pvt.h.
> 

They both are functionally correct.
_generic is correct but does not comply with the C11 specification. 
_c11 is correct and fully compliant with the C11 specification.

Replacing atomic_load(cons.tail, acquire) with load(cons.tail, relaxed) in
_c11 would make it non-compliant with C11 the spec.
Previous message (by thread): rte_ring move head question for machines with relaxed MO (arm/ppc)
Next message (by thread): rte_ring move head question for machines with relaxed MO (arm/ppc)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list