[dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Gavin Hu (Arm Technology China) Gavin.Hu at arm.com
Mon Oct 8 12:41:27 CEST 2018


This is ThunderX2.

> -----Original Message-----
> From: Ola Liljedahl
> Sent: Monday, October 8, 2018 6:39 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>; Jerin Jacob
> <jerin.jacob at caviumnetworks.com>
> Cc: dev at dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli at arm.com>; Ananyev, Konstantin
> <konstantin.ananyev at intel.com>; Steve Capper <Steve.Capper at arm.com>;
> nd <nd at arm.com>; stable at dpdk.org
> Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
> 
> On 08/10/2018, 12:33, "Gavin Hu (Arm Technology China)"
> <Gavin.Hu at arm.com> wrote:
> 
>     I did benchmarking w/o and w/ the patch, it did not show any noticeable
> differences in terms of latency.
> Which platform is this?
> 
> 
>     Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
> 
>     sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024
> -- -i
> 
>     RTE>>ring_perf_autotest (#1 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.49
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.55
>     MP/MC bulk enq/dequeue (size: 8): 12.99
>     SP/SC bulk enq/dequeue (size: 32): 1.97
>     MP/MC bulk enq/dequeue (size: 32): 3.41
> 
>     RTE>>ring_perf_autotest(#2 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.50
>     MP/MC bulk enq/dequeue (size: 8): 3.08
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 13.00
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.41
>     Test OK
> 
>     RTE>>ring_perf_autotest(#3 run of test without the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.24
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.40
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.59
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.49
>     MP/MC bulk enq/dequeue (size: 8): 3.08
>     SP/SC bulk enq/dequeue (size: 32): 1.07
>     MP/MC bulk enq/dequeue (size: 32): 1.13
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.55
>     MP/MC bulk enq/dequeue (size: 8): 12.96
>     SP/SC bulk enq/dequeue (size: 32): 1.99
>     MP/MC bulk enq/dequeue (size: 32): 3.37
> 
> 
>     RTE>>ring_perf_autotest(#1 run of the test with the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.32
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.43
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.60
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.53
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.06
>     MP/MC bulk enq/dequeue (size: 32): 1.18
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 12.93
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.44
>     Test OK
>     RTE>>ring_perf_autotest (#1 run of the test with the patch)
>     ### Testing single element and burst enq/deq ###
>     SP/SC single enq/dequeue: 8
>     MP/MC single enq/dequeue: 10
>     SP/SC burst enq/dequeue (size: 8): 1
>     MP/MC burst enq/dequeue (size: 8): 1
>     SP/SC burst enq/dequeue (size: 32): 0
>     MP/MC burst enq/dequeue (size: 32): 0
> 
>     ### Testing empty dequeue ###
>     SC empty dequeue: 0.24
>     MC empty dequeue: 0.32
> 
>     ### Testing using a single lcore ###
>     SP/SC bulk enq/dequeue (size: 8): 1.43
>     MP/MC bulk enq/dequeue (size: 8): 1.74
>     SP/SC bulk enq/dequeue (size: 32): 0.60
>     MP/MC bulk enq/dequeue (size: 32): 0.68
> 
>     ### Testing using two hyperthreads ###
>     SP/SC bulk enq/dequeue (size: 8): 2.53
>     MP/MC bulk enq/dequeue (size: 8): 3.09
>     SP/SC bulk enq/dequeue (size: 32): 1.06
>     MP/MC bulk enq/dequeue (size: 32): 1.18
> 
>     ### Testing using two physical cores ###
>     SP/SC bulk enq/dequeue (size: 8): 6.57
>     MP/MC bulk enq/dequeue (size: 8): 12.93
>     SP/SC bulk enq/dequeue (size: 32): 1.98
>     MP/MC bulk enq/dequeue (size: 32): 3.46
>     Test OK
> 
>     > -----Original Message-----
>     > From: Ola Liljedahl
>     > Sent: Monday, October 8, 2018 6:26 PM
>     > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>     > Cc: dev at dpdk.org; Honnappa Nagarahalli
>     > <Honnappa.Nagarahalli at arm.com>; Ananyev, Konstantin
>     > <konstantin.ananyev at intel.com>; Gavin Hu (Arm Technology China)
>     > <Gavin.Hu at arm.com>; Steve Capper <Steve.Capper at arm.com>; nd
>     > <nd at arm.com>; stable at dpdk.org
>     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >
>     >
>     >
>     > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob at caviumnetworks.com>
>     > wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
>     >     > From: Ola Liljedahl <Ola.Liljedahl at arm.com>
>     >     > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>     >     > CC: "dev at dpdk.org" <dev at dpdk.org>, Honnappa Nagarahalli
>     >     >  <Honnappa.Nagarahalli at arm.com>, "Ananyev, Konstantin"
>     >     >  <konstantin.ananyev at intel.com>, "Gavin Hu (Arm Technology
> China)"
>     >     >  <Gavin.Hu at arm.com>, Steve Capper <Steve.Capper at arm.com>, nd
>     > <nd at arm.com>,
>     >     >  "stable at dpdk.org" <stable at dpdk.org>
>     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >
>     >     > External Email
>     >     >
>     >     > On 08/10/2018, 08:06, "Jerin Jacob"
> <jerin.jacob at caviumnetworks.com>
>     > wrote:
>     >     >
>     >     >     -----Original Message-----
>     >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
>     >     >     > From: Ola Liljedahl <Ola.Liljedahl at arm.com>
>     >     >     > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>     >     >     > CC: "dev at dpdk.org" <dev at dpdk.org>, Honnappa Nagarahalli
>     >     >     >  <Honnappa.Nagarahalli at arm.com>, "Ananyev, Konstantin"
>     >     >     >  <konstantin.ananyev at intel.com>, "Gavin Hu (Arm Technology
>     > China)"
>     >     >     >  <Gavin.Hu at arm.com>, Steve Capper <Steve.Capper at arm.com>,
> nd
>     > <nd at arm.com>,
>     >     >     >  "stable at dpdk.org" <stable at dpdk.org>
>     >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
>     >     >     >
>     >     >
>     >     >
>     >     >     Could you please fix the email client for inline reply.
>     >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
>     > Office365. The official Office365/Outlook
>     >     > documentation doesn't match the actual user interface...
>     >     >
>     >     >
>     >     >
>     >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
>     > clients.html
>     >     >
>     >     >
>     >     >     >
>     >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
>     > <jerin.jacob at caviumnetworks.com> wrote:
>     >     >     >
>     >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
>     > volatile ("":::"memory") of rte_pause().
>     >     >     >     I would n't have any issue, if the generated code code is same
> or
>     > better than the exiting case. but it not the case, Right?
>     >     >     > The existing case is actually not interesting (IMO) as it exposes
>     > undefined behaviour which allows the compiler to do anything. But you
>     > seem to be satisfied with "works for me, right here right now". I think the
>     > cost of avoiding undefined behaviour is acceptable (actually I don't think
> it
>     > even will be noticeable).
>     >     >
>     >     >     I am not convinced because of use of volatile in head and tail
> indexes.
>     >     >     For me that brings the defined behavior.
>     >     > As long as you don't mix in C11 atomic accesses (just use "plain"
> accesses
>     > to volatile objects),
>     >     > it is AFAIK defined behaviour (but not necessarily using atomic loads
> and
>     > stores). But I quoted
>     >     > the C11 spec where it explicitly mentions that mixing atomic and non-
>     > atomic accesses to the same
>     >     > object is undefined behaviour. Don't argue with me, argue with the
> C11
>     > spec.
>     >     > If you want to disobey the spec, this should at least be called out for
> in
>     > the code with a comment.
>     >
>     >     That's boils down only one question, should we follow C11 spec? Why
> not
>     > only take load
>     >     acquire and store release semantics only just like Linux kernel and
> FreeBSD.
>     > And introduce even more undefined behaviour?
>     >
>     >     Does not look like C11 memory model is super efficient in term of gcc
>     >     implementation.
>     > You are making a chicken out of a feather.
>     >
>     > I think this "problem" with one additional ADD instruction will only
> concern
>     > __atomic_load_n(__ATOMIC_RELAXED) and
>     > __atomic_store_n(__ATOMIC_RELAXED) because the compiler
> separates
>     > the address generation (add offset of struct member) from the load or
> store
>     > itself. For other atomic operations and memory orderings (e.g.
>     > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will
> be
>     > included anyway (as long as we access a non-first struct member)
> because
>     > e.g. LDAR only accepts a base register with no offset.
>     >
>     > I suggest minimising the imposed memory orderings can have a much
> larger
>     > (positive) effect on performance compared to avoiding one ADD
> instruction
>     > (memory accesses are much slower than CPU ALU instructions).
>     > Using C11 memory model and identifying exactly which objects are used
> for
>     > synchronisation and whether (any) updates to shared memory are
> acquired
>     > or released (no updates to shared memory means relaxed order can be
> used)
>     > will provide maximum freedom to the compiler and hardware to get the
> best
>     > result.
>     >
>     > The FreeBSD and DPDK ring buffers show some fundamental
>     > misunderstandings here. Instead excessive orderings and explicit barriers
>     > have been used as band-aids, with unknown effects on performance.
>     >
>     >
>     >     >
>     >     >
>     >     >     That the reason why I shared
>     >     >     the generated assembly code. If you think other way, Pick any
> compiler
>     >     >     and see generated output.
>     >     > This is what one compiler for one architecture generates today.
> These
>     > things change. Other things
>     >     > that used to work or worked for some specific architecture has
> stopped
>     > working in newer versions of
>     >     > the compiler.
>     >     >
>     >     >
>     >     >     And
>     >     >
>     >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
>     > Don't have
>     >     >     such logic, See
>     >
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
>     >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
>     > inspired API although I don't see
>     >     > exactly how e.g. atomic_store_rel_int() is implemented. The code
> also
>     > mixes in explicit barriers
>     >     > so definitively not pure C11 memory model usage. And finally, it
> doesn't
>     > establish the proper
>     >     > load-acquire/store-release relationships (e.g. store-release cons_tail
>     > requires a load-acquire cons_tail,
>     >     > same for prod_tail).
>     >     >
>     >     > "* multi-producer safe lock-free ring buffer enqueue"
>     >     > The comment is also wrong. This design is not lock-free, how could it
> be
>     > when there is spinning
>     >     > (waiting) for other threads in the code? If a thread must wait for
> other
>     > threads, then by definition
>     >     > the design is blocking.
>     >     >
>     >     > So you are saying that because FreeBSD is doing it wrong, DPDK can
> also
>     > do it wrong?
>     >     >
>     >     >
>     >     >     See below too.
>     >     >
>     >     >     >
>     >     >     > Skipping the compiler memory barrier in rte_pause() potentially
>     > allows for optimisations that provide much more benefit, e.g. hiding
> some
>     > cache miss latency for later loads. The DPDK ring buffer implementation is
>     > defined so to enable inlining of enqueue/dequeue functions into the
> caller,
>     > any code could immediately follow these calls.
>     >     >     >
>     >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
>     >     >     > Programming languages — C
>     >     >     >
>     >     >     > 5.1.2.4
>     >     >     > 4 Two expression evaluations conflict if one of them modifies a
>     > memory location and the other one reads or modifies the same memory
>     > location.
>     >     >     >
>     >     >     > 25 The execution of a program contains a data race if it contains
> two
>     > conflicting actions in different threads, at least one of which is not atomic,
>     > and neither happens before the other. Any such data race results in
>     > undefined behavior.
>     >     >
>     >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit
> read
>     > will atomic
>     >     >     for 32b and 64b machines. If not, the problem persist for generic
> case
>     >     >     as well(lib/librte_ring/rte_ring_generic.h)
>     >     > The read from a volatile object is not an atomic access per the C11
> spec. It
>     > just happens to
>     >     > be translated to an instruction (on x86-64 and AArch64/A64) that
>     > implements an atomic load.
>     >     > I don't think any compiler would change this code generation and
>     > suddenly generate some
>     >     > non-atomic load instruction for a program that *only* uses volatile to
> do
>     > "atomic" accesses.
>     >     > But a future compiler could detect the mix of atomic and non-atomic
>     > accesses and mark this
>     >     > expression as causing undefined behaviour and that would have
>     > consequences for code generation.
>     >     >
>     >     >
>     >     >     I agree with you on C11 memory model semantics usage. The
> reason
>     > why I
>     >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self
> did
>     > not
>     >     >     had definitions for load acquire and store release semantics.
>     >     >     I was looking for taking load acquire and store release semantics
>     >     >     from C11 instead of creating new API like Linux kernel for
> FreeBSD(APIs
>     >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name
> is
>     > your
>     >     >     concern then we could create new abstractions as well. That would
>     > help
>     >     >     exiting KNI problem as well.
>     >     > I appreciate your embrace of the C11 memory model. I think it is
> better
>     > for describing
>     >     > (both to the compiler and to humans) which and how objects are
> used
>     > for synchronisation.
>     >     >
>     >     > However, I don't think an API as you suggest (and others have
> suggested
>     > before, e.g. as
>     >     > done in ODP) is a good idea. There is an infinite amount of possible
> base
>     > types, an
>     >     > increasing number of operations and a bunch of different memory
>     > orderings, a "complete"
>     >     > API would be very large and difficult to test, and most members of
> the
>     > API would never be used.
>     >     > GCC and Clang both support the __atomic intrinsics. This API avoids
> the
>     > problems I
>     >     > described above. Or we could use the official C11 syntax
> (stdatomic.h).
>     > But then we
>     >     > have the problem with using pre-C11 compilers...
>     >
>     >     I have no objection, if everyone agrees to move C11 memory model
>     >     with __atomic intrinsics. But if we need to keep both have then
>     >     atomic_load_acq_32() kind of API make sense.
>     >
>     >
>     >     >
>     >     >
>     >     >
>     >     >
>     >     >     I think, currently it mixed usage because, the same variable
> declaration
>     >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for
> C11
>     >     >     case. Either we need to change only to C11 mode OR have APIs for
>     >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models
> like
>     >     >     Linux kernel and FreeBSD.
>     >     >
>     >     >     >
>     >     >     > -- Ola
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
> 
> 



More information about the dev mailing list