[dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Ola Liljedahl Ola.Liljedahl at arm.com
Mon Oct 8 12:39:10 CEST 2018


On 08/10/2018, 12:33, "Gavin Hu (Arm Technology China)" <Gavin.Hu at arm.com> wrote:

    I did benchmarking w/o and w/ the patch, it did not show any noticeable differences in terms of latency.
Which platform is this?


    Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch).
    
    sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024  -- -i
    
    RTE>>ring_perf_autotest (#1 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.49
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.55
    MP/MC bulk enq/dequeue (size: 8): 12.99
    SP/SC bulk enq/dequeue (size: 32): 1.97
    MP/MC bulk enq/dequeue (size: 32): 3.41
    
    RTE>>ring_perf_autotest(#2 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.50
    MP/MC bulk enq/dequeue (size: 8): 3.08
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 13.00
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.41
    Test OK
    
    RTE>>ring_perf_autotest(#3 run of test without the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.24
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.40
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.59
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.49
    MP/MC bulk enq/dequeue (size: 8): 3.08
    SP/SC bulk enq/dequeue (size: 32): 1.07
    MP/MC bulk enq/dequeue (size: 32): 1.13
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.55
    MP/MC bulk enq/dequeue (size: 8): 12.96
    SP/SC bulk enq/dequeue (size: 32): 1.99
    MP/MC bulk enq/dequeue (size: 32): 3.37
    
    
    RTE>>ring_perf_autotest(#1 run of the test with the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.32
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.43
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.60
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.53
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.06
    MP/MC bulk enq/dequeue (size: 32): 1.18
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 12.93
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.44
    Test OK
    RTE>>ring_perf_autotest (#1 run of the test with the patch)
    ### Testing single element and burst enq/deq ###
    SP/SC single enq/dequeue: 8
    MP/MC single enq/dequeue: 10
    SP/SC burst enq/dequeue (size: 8): 1
    MP/MC burst enq/dequeue (size: 8): 1
    SP/SC burst enq/dequeue (size: 32): 0
    MP/MC burst enq/dequeue (size: 32): 0
    
    ### Testing empty dequeue ###
    SC empty dequeue: 0.24
    MC empty dequeue: 0.32
    
    ### Testing using a single lcore ###
    SP/SC bulk enq/dequeue (size: 8): 1.43
    MP/MC bulk enq/dequeue (size: 8): 1.74
    SP/SC bulk enq/dequeue (size: 32): 0.60
    MP/MC bulk enq/dequeue (size: 32): 0.68
    
    ### Testing using two hyperthreads ###
    SP/SC bulk enq/dequeue (size: 8): 2.53
    MP/MC bulk enq/dequeue (size: 8): 3.09
    SP/SC bulk enq/dequeue (size: 32): 1.06
    MP/MC bulk enq/dequeue (size: 32): 1.18
    
    ### Testing using two physical cores ###
    SP/SC bulk enq/dequeue (size: 8): 6.57
    MP/MC bulk enq/dequeue (size: 8): 12.93
    SP/SC bulk enq/dequeue (size: 32): 1.98
    MP/MC bulk enq/dequeue (size: 32): 3.46
    Test OK
    
    > -----Original Message-----
    > From: Ola Liljedahl
    > Sent: Monday, October 8, 2018 6:26 PM
    > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
    > Cc: dev at dpdk.org; Honnappa Nagarahalli
    > <Honnappa.Nagarahalli at arm.com>; Ananyev, Konstantin
    > <konstantin.ananyev at intel.com>; Gavin Hu (Arm Technology China)
    > <Gavin.Hu at arm.com>; Steve Capper <Steve.Capper at arm.com>; nd
    > <nd at arm.com>; stable at dpdk.org
    > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    > 
    > 
    > 
    > On 08/10/2018, 12:00, "Jerin Jacob" <jerin.jacob at caviumnetworks.com>
    > wrote:
    > 
    >     -----Original Message-----
    >     > Date: Mon, 8 Oct 2018 09:22:05 +0000
    >     > From: Ola Liljedahl <Ola.Liljedahl at arm.com>
    >     > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
    >     > CC: "dev at dpdk.org" <dev at dpdk.org>, Honnappa Nagarahalli
    >     >  <Honnappa.Nagarahalli at arm.com>, "Ananyev, Konstantin"
    >     >  <konstantin.ananyev at intel.com>, "Gavin Hu (Arm Technology China)"
    >     >  <Gavin.Hu at arm.com>, Steve Capper <Steve.Capper at arm.com>, nd
    > <nd at arm.com>,
    >     >  "stable at dpdk.org" <stable at dpdk.org>
    >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >
    >     > External Email
    >     >
    >     > On 08/10/2018, 08:06, "Jerin Jacob" <jerin.jacob at caviumnetworks.com>
    > wrote:
    >     >
    >     >     -----Original Message-----
    >     >     > Date: Sun, 7 Oct 2018 20:44:54 +0000
    >     >     > From: Ola Liljedahl <Ola.Liljedahl at arm.com>
    >     >     > To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
    >     >     > CC: "dev at dpdk.org" <dev at dpdk.org>, Honnappa Nagarahalli
    >     >     >  <Honnappa.Nagarahalli at arm.com>, "Ananyev, Konstantin"
    >     >     >  <konstantin.ananyev at intel.com>, "Gavin Hu (Arm Technology
    > China)"
    >     >     >  <Gavin.Hu at arm.com>, Steve Capper <Steve.Capper at arm.com>, nd
    > <nd at arm.com>,
    >     >     >  "stable at dpdk.org" <stable at dpdk.org>
    >     >     > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load
    >     >     > user-agent: Microsoft-MacOutlook/10.11.0.180909
    >     >     >
    >     >
    >     >
    >     >     Could you please fix the email client for inline reply.
    >     > Sorry that doesn't seem to be possible with Outlook for Mac 16 or
    > Office365. The official Office365/Outlook
    >     > documentation doesn't match the actual user interface...
    >     >
    >     >
    >     >
    >     >     https://www.kernel.org/doc/html/v4.19-rc7/process/email-
    > clients.html
    >     >
    >     >
    >     >     >
    >     >     > On 07/10/2018, 06:03, "Jerin Jacob"
    > <jerin.jacob at caviumnetworks.com> wrote:
    >     >     >
    >     >     >     In arm64 case, it will have ATOMIC_RELAXED followed by asm
    > volatile ("":::"memory") of rte_pause().
    >     >     >     I would n't have any issue, if the generated code code is same or
    > better than the exiting case. but it not the case, Right?
    >     >     > The existing case is actually not interesting (IMO) as it exposes
    > undefined behaviour which allows the compiler to do anything. But you
    > seem to be satisfied with "works for me, right here right now". I think the
    > cost of avoiding undefined behaviour is acceptable (actually I don't think it
    > even will be noticeable).
    >     >
    >     >     I am not convinced because of use of volatile in head and tail indexes.
    >     >     For me that brings the defined behavior.
    >     > As long as you don't mix in C11 atomic accesses (just use "plain" accesses
    > to volatile objects),
    >     > it is AFAIK defined behaviour (but not necessarily using atomic loads and
    > stores). But I quoted
    >     > the C11 spec where it explicitly mentions that mixing atomic and non-
    > atomic accesses to the same
    >     > object is undefined behaviour. Don't argue with me, argue with the C11
    > spec.
    >     > If you want to disobey the spec, this should at least be called out for in
    > the code with a comment.
    > 
    >     That's boils down only one question, should we follow C11 spec? Why not
    > only take load
    >     acquire and store release semantics only just like Linux kernel and FreeBSD.
    > And introduce even more undefined behaviour?
    > 
    >     Does not look like C11 memory model is super efficient in term of gcc
    >     implementation.
    > You are making a chicken out of a feather.
    > 
    > I think this "problem" with one additional ADD instruction will only concern
    > __atomic_load_n(__ATOMIC_RELAXED) and
    > __atomic_store_n(__ATOMIC_RELAXED) because the compiler separates
    > the address generation (add offset of struct member) from the load or store
    > itself. For other atomic operations and memory orderings (e.g.
    > __atomic_load_n(__ATOMIC_ACQUIRE), the extra ADD instruction will be
    > included anyway (as long as we access a non-first struct member) because
    > e.g. LDAR only accepts a base register with no offset.
    > 
    > I suggest minimising the imposed memory orderings can have a much larger
    > (positive) effect on performance compared to avoiding one ADD instruction
    > (memory accesses are much slower than CPU ALU instructions).
    > Using C11 memory model and identifying exactly which objects are used for
    > synchronisation and whether (any) updates to shared memory are acquired
    > or released (no updates to shared memory means relaxed order can be used)
    > will provide maximum freedom to the compiler and hardware to get the best
    > result.
    > 
    > The FreeBSD and DPDK ring buffers show some fundamental
    > misunderstandings here. Instead excessive orderings and explicit barriers
    > have been used as band-aids, with unknown effects on performance.
    > 
    > 
    >     >
    >     >
    >     >     That the reason why I shared
    >     >     the generated assembly code. If you think other way, Pick any compiler
    >     >     and see generated output.
    >     > This is what one compiler for one architecture generates today. These
    > things change. Other things
    >     > that used to work or worked for some specific architecture has stopped
    > working in newer versions of
    >     > the compiler.
    >     >
    >     >
    >     >     And
    >     >
    >     >     Freebsd implementation of ring buffer(Which DPDK derived from),
    > Don't have
    >     >     such logic, See
    > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L108
    >     > It looks like FreeBSD uses some kind of C11 atomic memory model-
    > inspired API although I don't see
    >     > exactly how e.g. atomic_store_rel_int() is implemented. The code also
    > mixes in explicit barriers
    >     > so definitively not pure C11 memory model usage. And finally, it doesn't
    > establish the proper
    >     > load-acquire/store-release relationships (e.g. store-release cons_tail
    > requires a load-acquire cons_tail,
    >     > same for prod_tail).
    >     >
    >     > "* multi-producer safe lock-free ring buffer enqueue"
    >     > The comment is also wrong. This design is not lock-free, how could it be
    > when there is spinning
    >     > (waiting) for other threads in the code? If a thread must wait for other
    > threads, then by definition
    >     > the design is blocking.
    >     >
    >     > So you are saying that because FreeBSD is doing it wrong, DPDK can also
    > do it wrong?
    >     >
    >     >
    >     >     See below too.
    >     >
    >     >     >
    >     >     > Skipping the compiler memory barrier in rte_pause() potentially
    > allows for optimisations that provide much more benefit, e.g. hiding some
    > cache miss latency for later loads. The DPDK ring buffer implementation is
    > defined so to enable inlining of enqueue/dequeue functions into the caller,
    > any code could immediately follow these calls.
    >     >     >
    >     >     > From INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:201x
    >     >     > Programming languages — C
    >     >     >
    >     >     > 5.1.2.4
    >     >     > 4 Two expression evaluations conflict if one of them modifies a
    > memory location and the other one reads or modifies the same memory
    > location.
    >     >     >
    >     >     > 25 The execution of a program contains a data race if it contains two
    > conflicting actions in different threads, at least one of which is not atomic,
    > and neither happens before the other. Any such data race results in
    > undefined behavior.
    >     >
    >     >     IMO, Both condition will satisfy if the variable is volatile and 32bit read
    > will atomic
    >     >     for 32b and 64b machines. If not, the problem persist for generic case
    >     >     as well(lib/librte_ring/rte_ring_generic.h)
    >     > The read from a volatile object is not an atomic access per the C11 spec. It
    > just happens to
    >     > be translated to an instruction (on x86-64 and AArch64/A64) that
    > implements an atomic load.
    >     > I don't think any compiler would change this code generation and
    > suddenly generate some
    >     > non-atomic load instruction for a program that *only* uses volatile to do
    > "atomic" accesses.
    >     > But a future compiler could detect the mix of atomic and non-atomic
    > accesses and mark this
    >     > expression as causing undefined behaviour and that would have
    > consequences for code generation.
    >     >
    >     >
    >     >     I agree with you on C11 memory model semantics usage. The reason
    > why I
    >     >     propose name for the file as rte_ring_c11_mem.h as DPDK it self did
    > not
    >     >     had definitions for load acquire and store release semantics.
    >     >     I was looking for taking load acquire and store release semantics
    >     >     from C11 instead of creating new API like Linux kernel for FreeBSD(APIs
    >     >     like  atomic_load_acq_32(), atomic_store_rel_32()). If the file name is
    > your
    >     >     concern then we could create new abstractions as well. That would
    > help
    >     >     exiting KNI problem as well.
    >     > I appreciate your embrace of the C11 memory model. I think it is better
    > for describing
    >     > (both to the compiler and to humans) which and how objects are used
    > for synchronisation.
    >     >
    >     > However, I don't think an API as you suggest (and others have suggested
    > before, e.g. as
    >     > done in ODP) is a good idea. There is an infinite amount of possible base
    > types, an
    >     > increasing number of operations and a bunch of different memory
    > orderings, a "complete"
    >     > API would be very large and difficult to test, and most members of the
    > API would never be used.
    >     > GCC and Clang both support the __atomic intrinsics. This API avoids the
    > problems I
    >     > described above. Or we could use the official C11 syntax (stdatomic.h).
    > But then we
    >     > have the problem with using pre-C11 compilers...
    > 
    >     I have no objection, if everyone agrees to move C11 memory model
    >     with __atomic intrinsics. But if we need to keep both have then
    >     atomic_load_acq_32() kind of API make sense.
    > 
    > 
    >     >
    >     >
    >     >
    >     >
    >     >     I think, currently it mixed usage because, the same variable declaration
    >     >     used for C11 vs non C11 usage.Ideally we wont need "volatile" for C11
    >     >     case. Either we need to change only to C11 mode OR have APIs for
    >     >     atomic_load_acq_() and atomic_store_rel_() to allow both models like
    >     >     Linux kernel and FreeBSD.
    >     >
    >     >     >
    >     >     > -- Ola
    >     >     >
    >     >     >
    >     >     >
    >     >
    >     >
    > 
    
    



More information about the dev mailing list