[dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Fri Oct 5 22:29:33 CEST 2018
Previous message: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
Next message: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> 
> I doubt it is possible to benchmark with such a precision so to see the
> potential difference of one ADD instruction.
> Just changes in function alignment can affect performance by percents. And
> the natural variation when not using a 100% deterministic system is going to
> be a lot larger than one cycle per ring buffer operation.
> 
> Some of the other patches are also for correctness (e.g. load-acquire of tail)
The discussion is about this patch alone. Other patches are already Acked.

> so while performance measurements may be interesting, we can't skip a bug
> fix just because it proves to decrease performance.
IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.

> 
> -- Ola
> 
> On 05/10/2018, 22:06, "Honnappa Nagarahalli"
> <Honnappa.Nagarahalli at arm.com> wrote:
> 
>     Hi Jerin,
>     	Thank you for generating the disassembly, that is really helpful. I
> agree with you that we have the option of moving parts 2 and 3 forward. I
> will let Gavin take a decision.
> 
>     I suggest that we run benchmarks on this patch alone and in combination
> with other patches in the series. We have few Arm machines and we will run
> on all of them along with x86. We take a decision based on that.
> 
>     Would that be a way to move forward? I think this should address both
> your and Ola's concerns.
> 
>     I am open for other suggestions as well.
> 
>     Thank you,
>     Honnappa
> 
>     >
>     > So you don't want to write the proper C11 code because the compiler
>     > generates one extra instruction that way?
>     > You don't even know if that one extra instruction has any measurable
>     > impact on performance. E.g. it could be issued the cycle before together
>     > with other instructions.
>     >
>     > We can complain to the compiler writers that the code generation for
>     > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
>     > ARM/A64). I think the problem is that the __atomic builtins only accept
> a
>     > base address without any offset and this is possibly because e.g.
> load/store
>     > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
> only
>     > accept a base register with no offset. So any offset has to be added
> before
>     > the actual "atomic" instruction, LDR in this case.
>     >
>     >
>     > -- Ola
>     >
>     >
>     > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob at caviumnetworks.com>
>     > wrote:
>     >
>     >     -----Original Message-----
>     >     > Date: Fri, 5 Oct 2018 15:11:44 +0000
>     >     > From: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
>     >     > To: "Ananyev, Konstantin" <konstantin.ananyev at intel.com>, Ola
>     > Liljedahl
>     >     >  <Ola.Liljedahl at arm.com>, "Gavin Hu (Arm Technology China)"
>     >     >  <Gavin.Hu at arm.com>, Jerin Jacob
> <jerin.jacob at caviumnetworks.com>
>     >     > CC: "dev at dpdk.org" <dev at dpdk.org>, Steve Capper
>     > <Steve.Capper at arm.com>, nd
>     >     >  <nd at arm.com>, "stable at dpdk.org" <stable at dpdk.org>
>     >     > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
>     >     >
>     >     > > >         > Hi Jerin,
>     >     > > >         >
>     >     > > >         > Thanks for your review, inline comments from our
> internal
>     >     > > discussions.
>     >     > > >         >
>     >     > > >         > BR. Gavin
>     >     > > >         >
>     >     > > >         > > -----Original Message-----
>     >     > > >         > > From: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>     >     > > >         > > Sent: Saturday, September 29, 2018 6:49 PM
>     >     > > >         > > To: Gavin Hu (Arm Technology China)
> <Gavin.Hu at arm.com>
>     >     > > >         > > Cc: dev at dpdk.org; Honnappa Nagarahalli
>     >     > > >         > > <Honnappa.Nagarahalli at arm.com>; Steve Capper
>     >     > > >         > > <Steve.Capper at arm.com>; Ola Liljedahl
>     > <Ola.Liljedahl at arm.com>;
>     >     > > nd
>     >     > > >         > > <nd at arm.com>; stable at dpdk.org
>     >     > > >         > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
> load
>     >     > > >         > >
>     >     > > >         > > -----Original Message-----
>     >     > > >         > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
>     >     > > >         > > > From: Gavin Hu <gavin.hu at arm.com>
>     >     > > >         > > > To: dev at dpdk.org
>     >     > > >         > > > CC: gavin.hu at arm.com,
> Honnappa.Nagarahalli at arm.com,
>     >     > > >         > > > steve.capper at arm.com,  Ola.Liljedahl at arm.com,
>     >     > > >         > > > jerin.jacob at caviumnetworks.com, nd at arm.com,
>     >     > > stable at dpdk.org
>     >     > > >         > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
> load
>     >     > > >         > > > X-Mailer: git-send-email 2.7.4
>     >     > > >         > > >
>     >     > > >         > > > External Email
>     >     > > >         > > >
>     >     > > >         > > > In update_tail, read ht->tail using
>     > __atomic_load.Although the
>     >     > > >         > > > compiler currently seems to be doing the right thing
> even
>     > without
>     >     > > >         > > > _atomic_load, we don't want to give the compiler
>     > freedom to
>     >     > > optimise
>     >     > > >         > > > what should be an atomic load, it should not be
> arbitarily
>     > moved
>     >     > > >         > > > around.
>     >     > > >         > > >
>     >     > > >         > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
>     > barrier
>     >     > > option")
>     >     > > >         > > > Cc: stable at dpdk.org
>     >     > > >         > > >
>     >     > > >         > > > Signed-off-by: Gavin Hu <gavin.hu at arm.com>
>     >     > > >         > > > Reviewed-by: Honnappa Nagarahalli
>     >     > > <Honnappa.Nagarahalli at arm.com>
>     >     > > >         > > > Reviewed-by: Steve Capper <steve.capper at arm.com>
>     >     > > >         > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl at arm.com>
>     >     > > >         > > > ---
>     >     > > >         > > >  lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
>     >     > > >         > > >  1 file changed, 2 insertions(+), 1 deletion(-)
>     >     > > >         > > >
>     >     > > >         > The read of ht->tail needs to be atomic, a non-atomic
> read
>     > would not
>     >     > > be correct.
>     >     > > >
>     >     > > >         That's a 32bit value load.
>     >     > > >         AFAIK on all CPUs that we support it is an atomic operation.
>     >     > > >     [Ola] But that the ordinary C load is translated to an atomic
> load
>     > for the
>     >     > > target architecture is incidental.
>     >     > > >
>     >     > > >     If the design requires an atomic load (which is the case here),
> we
>     >     > > > should use an atomic load on the language level. Then we can
> be
>     > sure it will
>     >     > > always be translated to an atomic load for the target in question
> or
>     >     > > compilation will fail. We don't have to depend on assumptions.
>     >     > >
>     >     > > We all know that 32bit load/store on cpu we support - are atomic.
>     >     > > If it wouldn't be the case - DPDK would be broken in dozen places.
>     >     > > So what the point to pretend that "it might be not atomic" if we
> do
>     > know for
>     >     > > sure that it is?
>     >     > > I do understand that you want to use atomic_load(relaxed) here
> for
>     >     > > consistency, and to conform with C11 mem-model and I don't see
> any
>     > harm in
>     >     > > that.
>     >     > We can continue to discuss the topic, it is a good discussion. But, as
> far
>     > this patch is concerned, can I consider this as us having a consensus?
> The
>     > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
> do
>     > not see any harm in having code that completely conforms to C11
> memory
>     > model.
>     >
>     >     Have you guys checked the output assembly with and without atomic
>     > load?
>     >     There is an extra "add" instruction with at least the code I have
> checked.
>     >     I think, compiler is not smart enough to understand it is a dead code
> for
>     >     arm64.
>     >
>     >     ➜ [~] $ aarch64-linux-gnu-gcc -v
>     >     Using built-in specs.
>     >     COLLECT_GCC=aarch64-linux-gnu-gcc
>     >     COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
>     > wrapper
>     >     Target: aarch64-linux-gnu
>     >     Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
>     >     --prefix=/usr --program-prefix=aarch64-linux-gnu-
>     >     --with-local-prefix=/usr/aarch64-linux-gnu
>     >     --with-sysroot=/usr/aarch64-linux-gnu
>     >     --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
>     >     --libexecdir=/usr/lib --target=aarch64-linux-gnu
>     >     --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
>     >     --enable-languages=c,c++ --enable-shared --enable-threads=posix
>     >     --with-system-zlib --with-isl --enable-__cxa_atexit
>     >     --disable-libunwind-exceptions --enable-clocale=gnu
>     >     --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
>     >     --enable-linker-build-id --enable-lto --enable-plugin
>     >     --enable-install-libiberty --with-linker-hash-style=gnu
>     >     --enable-gnu-indirect-function --disable-multilib --disable-werror
>     >     --enable-checking=release
>     >     Thread model: posix
>     >     gcc version 8.2.0 (GCC)
>     >
>     >
>     >     # build setup
>     >     make -j 8 config T=arm64-armv8a-linuxapp-gcc  CROSS=aarch64-linux-
> gnu-
>     >     make -j 8 test-build CROSS=aarch64-linux-gnu-
>     >
>     >     # generate asm
>     >     aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
> 'disassemble /rs
>     > bucket_enqueue_single'
>     >
>     >     I have uploaded generated file for your convenience
>     >     with_atomic_load.txt(includes patch 1,2,3)
>     >     -----------------------
>     >     https://pastebin.com/SQ6w1yRu
>     >
>     >     without_atomic_load.txt(includes patch 2,3)
>     >     -----------------------
>     >     https://pastebin.com/BpvnD0CA
>     >
>     >
>     >     without_atomic
>     >     -------------
>     >     23              if (!single)
>     >        0x000000000068d290 <+240>:   85 00 00 35     cbnz    w5, 0x68d2a0
>     > <bucket_enqueue_single+256>
>     >        0x000000000068d294 <+244>:   82 04 40 b9     ldr     w2, [x4, #4]
>     >        0x000000000068d298 <+248>:   5f 00 01 6b     cmp     w2, w1
>     >        0x000000000068d29c <+252>:   21 01 00 54     b.ne    0x68d2c0
>     > <bucket_enqueue_single+288>  // b.any
>     >
>     >     24                      while (unlikely(ht->tail != old_val))
>     >     25                              rte_pause();
>     >
>     >
>     >     with_atomic
>     >     -----------
>     >     23              if (!single)
>     >        0x000000000068ceb0 <+240>:   00 10 04 91     add     x0, x0, #0x104
>     >        0x000000000068ceb4 <+244>:   84 00 00 35     cbnz    w4, 0x68cec4
>     > <bucket_enqueue_single+260>
>     >        0x000000000068ceb8 <+248>:   02 00 40 b9     ldr     w2, [x0]
>     >        0x000000000068cebc <+252>:   3f 00 02 6b     cmp     w1, w2
>     >        0x000000000068cec0 <+256>:   01 09 00 54     b.ne    0x68cfe0
>     > <bucket_enqueue_single+544>  // b.any
>     >
>     >     24                      while (unlikely(old_val != __atomic_load_n(&ht->tail,
>     > __ATOMIC_RELAXED)))
>     >
>     >
>     >     I don't want to block this series of patches due this patch. Can we
> make
>     >     re spin one series with 2 and 3 patches. And Wait for patch 1 to
> conclude?
>     >
>     >     Thoughts?
>     >
>     >
>     >
>     >
>     >     >
>     >     > > But argument that we shouldn't assume 32bit load/store ops as
>     > atomic
>     >     > > sounds a bit flaky to me.
>     >     > > Konstantin
>     >     > >
>     >     > >
>     >     > > >
>     >     > > >
>     >     > > >
>     >     > > >         > But there are no memory ordering requirements (with
>     >     > > >         > regards to other loads and/or stores by this thread) so
>     > relaxed
>     >     > > memory order is sufficient.
>     >     > > >         > Another aspect of using __atomic_load_n() is that the
>     >     > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
> has
>     > to be
>     >     > > done as
>     >     > > >         > specified in the source code which is also what we need
> here.
>     >     > > >
>     >     > > >         I think Jerin points that rte_pause() acts here as compiler
>     > barrier too,
>     >     > > >         so no need to worry that compiler would optimize out the
> loop.
>     >     > > >     [Ola] Sorry missed that. But the barrier behaviour of
> rte_pause()
>     >     > > > is not part of C11, is it essentially a hand-made feature to
> support
>     >     > > > the legacy multithreaded memory model (which uses explicit
> HW
>     > and
>     >     > > compiler barriers). I'd prefer code using the C11 memory model
> not to
>     >     > > depend on such legacy features.
>     >     > > >
>     >     > > >
>     >     > > >
>     >     > > >         Konstantin
>     >     > > >
>     >     > > >         >
>     >     > > >         > One point worth mentioning though is that this change is
> for
>     >     > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
> worth
>     > persisting
>     >     > > >         > with getting the C11 code right when people are less
> excited
>     > about
>     >     > > sending a release out?
>     >     > > >         >
>     >     > > >         > We can explain that for C11 we would prefer to do loads
> and
>     > stores
>     >     > > as per the C11 memory model. In the case of rte_ring, the code is
>     >     > > >         > separated cleanly into C11 specific files anyway.
>     >     > > >         >
>     >     > > >         > I think reading ht->tail using __atomic_load_n() is the
> most
>     >     > > appropriate way. We show that ht->tail is used for
> synchronization,
>     > we
>     >     > > >         > acknowledge that ht->tail may be written by other
> threads
>     >     > > > without any other kind of synchronization (e.g. no lock involved)
>     > and we
>     >     > > require
>     >     > > >         > an atomic load (any write to ht->tail must also be atomic).
>     >     > > >         >
>     >     > > >         > Using volatile and explicit compiler (or processor)
> memory
>     > barriers
>     >     > > (fences) is the legacy pre-C11 way of accomplishing these things.
>     >     > > > There's
>     >     > > >         > a reason why C11/C++11 moved away from the old ways.
>     >     > > >         > > >
>     >     > > >         > > >         __atomic_store_n(&ht->tail, new_val,
>     > __ATOMIC_RELEASE);
>     >     > > >         > > > --
>     >     > > >         > > > 2.7.4
>     >     > > >         > > >
>     >     > > >
>     >     > > >
>     >     > > >
>     >     >
>     >
> 
>
Previous message: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
Next message: [dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list