[dpdk-dev] [PATCH v3 2/3] ring: synchronize the load and store of the tail

Justin He Justin.He at arm.com
Wed Sep 26 11:59:02 CEST 2018


Hi Gavin

> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: 2018年9月26日 17:30
> To: Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>; dev at dpdk.org
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; Steve Capper
> <Steve.Capper at arm.com>; Ola Liljedahl <Ola.Liljedahl at arm.com>;
> jerin.jacob at caviumnetworks.com; nd <nd at arm.com>; stable at dpdk.org; Justin
> He <Justin.He at arm.com>
> Subject: RE: [PATCH v3 2/3] ring: synchronize the load and store of the tail
>
> +Justin He for review.
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu at arm.com>
> > Sent: Monday, September 17, 2018 4:17 PM
> > To: dev at dpdk.org
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli at arm.com>; Steve Capper
> > <Steve.Capper at arm.com>; Ola Liljedahl <Ola.Liljedahl at arm.com>;
> > jerin.jacob at caviumnetworks.com; nd <nd at arm.com>; stable at dpdk.org
> > Subject: [PATCH v3 2/3] ring: synchronize the load and store of the
> > tail
> >
> > Synchronize the load-acquire of the tail and the store-release within
> > update_tail, the store release ensures all the ring operations,
> > enqueue or dequeue, are seen by the observers on the other side as
> > soon as they see the updated tail. The load-acquire is needed here as
> > the data dependency is not a reliable way for ordering as the compiler
> > might break it by saving to temporary values to boost performance.
> > When computing the free_entries and avail_entries, use atomic
> > semantics to load the heads and tails instead.
> >
> > The patch was benchmarked with test/ring_perf_autotest and it
> > decreases the enqueue/dequeue latency by 5% ~ 27.6% with two lcores,
> > the real gains are dependent on the number of lcores, depth of the ring, SPSC
> or MPMC.
> > For 1 lcore, it also improves a little, about 3 ~ 4%.
> > It is a big improvement, in case of MPMC, with two lcores and ring
> > size of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.
> >
> > This patch is a bug fix, while the improvement is a bonus. In our
> > analysis the improvement comes from the cacheline pre-filling after
> > hoisting load- acquire from _atomic_compare_exchange_n up above.
> >
> > The test command:
> > $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4
> > --socket-mem=\
> > 1024 -- -i
> >
> > Test result with this patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.86  MP/MC bulk enq/dequeue (size:
> > 8): 10.15  SP/SC bulk enq/dequeue (size:
> > 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
> >
> > In comparison of the test result without this patch:
> >  SP/SC bulk enq/dequeue (size: 8): 6.67  MP/MC bulk enq/dequeue (size:
> > 8): 13.12  SP/SC bulk enq/dequeue (size:
> > 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
> >
> > Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> > Cc: stable at dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu at arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> > Reviewed-by: Steve Capper <steve.capper at arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl at arm.com>
> > ---
> >  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> > b/lib/librte_ring/rte_ring_c11_mem.h
> > index 234fea0..0eae3b3 100644
> > --- a/lib/librte_ring/rte_ring_c11_mem.h
> > +++ b/lib/librte_ring/rte_ring_c11_mem.h
> > @@ -68,13 +68,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > unsigned int is_sp,
> >  *old_head = __atomic_load_n(&r->prod.head,
> >  __ATOMIC_ACQUIRE);
> >
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> > value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> > +
> > __ATOMIC_ACQUIRE);
I ever noticed that freebsd also used the double __atomic_load_n. And as we
discussed it several months ago [1], seems the second load_acquire is not
necessary. But as you verified, it looks good to me
[1] http://mails.dpdk.org/archives/dev/2017-November/080983.html
So,
Reviewed-by: Jia He <justin.he at arm.com>

Cheers,
Justin (Jia He)
> > +
> > +/* The subtraction is done between two unsigned 32bits
> > value
> >   * (the result is always modulo 32 bits even if we have
> >   * *old_head > cons_tail). So 'free_entries' is always between 0
> >   * and capacity (which is < size).
> >   */
> > -*free_entries = (capacity + r->cons.tail - *old_head);
> > +*free_entries = (capacity + cons_tail - *old_head);
> >
> >  /* check that we have enough room in ring */
> >  if (unlikely(n > *free_entries))
> > @@ -132,15 +137,22 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_sc,
> >  do {
> >  /* Restore n as it may change every loop */
> >  n = max;
> > +
> >  *old_head = __atomic_load_n(&r->cons.head,
> >  __ATOMIC_ACQUIRE);
> >
> > +/* this load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> > +__ATOMIC_ACQUIRE);
> > +
> >  /* The subtraction is done between two unsigned 32bits value
> >   * (the result is always modulo 32 bits even if we have
> >   * cons_head > prod_tail). So 'entries' is always between 0
> >   * and size(ring)-1.
> >   */
> > -*entries = (r->prod.tail - *old_head);
> > +*entries = (prod_tail - *old_head);
> >
> >  /* Set the actual entries for dequeue */
> >  if (n > *entries)
> > --
> > 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


More information about the dev mailing list