[dpdk-dev] [PATCH v3 1/3] ring: read tail using atomic load
Honnappa Nagarahalli
Honnappa.Nagarahalli at arm.com
Fri Oct 5 22:29:33 CEST 2018
>
> I doubt it is possible to benchmark with such a precision so to see the
> potential difference of one ADD instruction.
> Just changes in function alignment can affect performance by percents. And
> the natural variation when not using a 100% deterministic system is going to
> be a lot larger than one cycle per ring buffer operation.
>
> Some of the other patches are also for correctness (e.g. load-acquire of tail)
The discussion is about this patch alone. Other patches are already Acked.
> so while performance measurements may be interesting, we can't skip a bug
> fix just because it proves to decrease performance.
IMO, this patch is not a bug fix - in terms of it fixing any failures with the current code.
>
> -- Ola
>
> On 05/10/2018, 22:06, "Honnappa Nagarahalli"
> <Honnappa.Nagarahalli at arm.com> wrote:
>
> Hi Jerin,
> Thank you for generating the disassembly, that is really helpful. I
> agree with you that we have the option of moving parts 2 and 3 forward. I
> will let Gavin take a decision.
>
> I suggest that we run benchmarks on this patch alone and in combination
> with other patches in the series. We have few Arm machines and we will run
> on all of them along with x86. We take a decision based on that.
>
> Would that be a way to move forward? I think this should address both
> your and Ola's concerns.
>
> I am open for other suggestions as well.
>
> Thank you,
> Honnappa
>
> >
> > So you don't want to write the proper C11 code because the compiler
> > generates one extra instruction that way?
> > You don't even know if that one extra instruction has any measurable
> > impact on performance. E.g. it could be issued the cycle before together
> > with other instructions.
> >
> > We can complain to the compiler writers that the code generation for
> > __atomic_load_n(, __ATOMIC_RELAXED) is not optimal (at least on
> > ARM/A64). I think the problem is that the __atomic builtins only accept
> a
> > base address without any offset and this is possibly because e.g.
> load/store
> > exclusive (LDX/STX) and load-acquire (LDAR) and store-release (STLR)
> only
> > accept a base register with no offset. So any offset has to be added
> before
> > the actual "atomic" instruction, LDR in this case.
> >
> >
> > -- Ola
> >
> >
> > On 05/10/2018, 19:07, "Jerin Jacob" <jerin.jacob at caviumnetworks.com>
> > wrote:
> >
> > -----Original Message-----
> > > Date: Fri, 5 Oct 2018 15:11:44 +0000
> > > From: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> > > To: "Ananyev, Konstantin" <konstantin.ananyev at intel.com>, Ola
> > Liljedahl
> > > <Ola.Liljedahl at arm.com>, "Gavin Hu (Arm Technology China)"
> > > <Gavin.Hu at arm.com>, Jerin Jacob
> <jerin.jacob at caviumnetworks.com>
> > > CC: "dev at dpdk.org" <dev at dpdk.org>, Steve Capper
> > <Steve.Capper at arm.com>, nd
> > > <nd at arm.com>, "stable at dpdk.org" <stable at dpdk.org>
> > > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load
> > >
> > > > > > Hi Jerin,
> > > > > >
> > > > > > Thanks for your review, inline comments from our
> internal
> > > > discussions.
> > > > > >
> > > > > > BR. Gavin
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerin.jacob at caviumnetworks.com>
> > > > > > > Sent: Saturday, September 29, 2018 6:49 PM
> > > > > > > To: Gavin Hu (Arm Technology China)
> <Gavin.Hu at arm.com>
> > > > > > > Cc: dev at dpdk.org; Honnappa Nagarahalli
> > > > > > > <Honnappa.Nagarahalli at arm.com>; Steve Capper
> > > > > > > <Steve.Capper at arm.com>; Ola Liljedahl
> > <Ola.Liljedahl at arm.com>;
> > > > nd
> > > > > > > <nd at arm.com>; stable at dpdk.org
> > > > > > > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic
> load
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > > Date: Mon, 17 Sep 2018 16:17:22 +0800
> > > > > > > > From: Gavin Hu <gavin.hu at arm.com>
> > > > > > > > To: dev at dpdk.org
> > > > > > > > CC: gavin.hu at arm.com,
> Honnappa.Nagarahalli at arm.com,
> > > > > > > > steve.capper at arm.com, Ola.Liljedahl at arm.com,
> > > > > > > > jerin.jacob at caviumnetworks.com, nd at arm.com,
> > > > stable at dpdk.org
> > > > > > > > Subject: [PATCH v3 1/3] ring: read tail using atomic
> load
> > > > > > > > X-Mailer: git-send-email 2.7.4
> > > > > > > >
> > > > > > > > External Email
> > > > > > > >
> > > > > > > > In update_tail, read ht->tail using
> > __atomic_load.Although the
> > > > > > > > compiler currently seems to be doing the right thing
> even
> > without
> > > > > > > > _atomic_load, we don't want to give the compiler
> > freedom to
> > > > optimise
> > > > > > > > what should be an atomic load, it should not be
> arbitarily
> > moved
> > > > > > > > around.
> > > > > > > >
> > > > > > > > Fixes: 39368ebfc6 ("ring: introduce C11 memory model
> > barrier
> > > > option")
> > > > > > > > Cc: stable at dpdk.org
> > > > > > > >
> > > > > > > > Signed-off-by: Gavin Hu <gavin.hu at arm.com>
> > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli at arm.com>
> > > > > > > > Reviewed-by: Steve Capper <steve.capper at arm.com>
> > > > > > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl at arm.com>
> > > > > > > > ---
> > > > > > > > lib/librte_ring/rte_ring_c11_mem.h | 3 ++-
> > > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > The read of ht->tail needs to be atomic, a non-atomic
> read
> > would not
> > > > be correct.
> > > > >
> > > > > That's a 32bit value load.
> > > > > AFAIK on all CPUs that we support it is an atomic operation.
> > > > > [Ola] But that the ordinary C load is translated to an atomic
> load
> > for the
> > > > target architecture is incidental.
> > > > >
> > > > > If the design requires an atomic load (which is the case here),
> we
> > > > > should use an atomic load on the language level. Then we can
> be
> > sure it will
> > > > always be translated to an atomic load for the target in question
> or
> > > > compilation will fail. We don't have to depend on assumptions.
> > > >
> > > > We all know that 32bit load/store on cpu we support - are atomic.
> > > > If it wouldn't be the case - DPDK would be broken in dozen places.
> > > > So what the point to pretend that "it might be not atomic" if we
> do
> > know for
> > > > sure that it is?
> > > > I do understand that you want to use atomic_load(relaxed) here
> for
> > > > consistency, and to conform with C11 mem-model and I don't see
> any
> > harm in
> > > > that.
> > > We can continue to discuss the topic, it is a good discussion. But, as
> far
> > this patch is concerned, can I consider this as us having a consensus?
> The
> > file rte_ring_c11_mem.h is specifically for C11 memory model and I also
> do
> > not see any harm in having code that completely conforms to C11
> memory
> > model.
> >
> > Have you guys checked the output assembly with and without atomic
> > load?
> > There is an extra "add" instruction with at least the code I have
> checked.
> > I think, compiler is not smart enough to understand it is a dead code
> for
> > arm64.
> >
> > ➜ [~] $ aarch64-linux-gnu-gcc -v
> > Using built-in specs.
> > COLLECT_GCC=aarch64-linux-gnu-gcc
> > COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8.2.0/lto-
> > wrapper
> > Target: aarch64-linux-gnu
> > Configured with: /build/aarch64-linux-gnu-gcc/src/gcc-8.2.0/configure
> > --prefix=/usr --program-prefix=aarch64-linux-gnu-
> > --with-local-prefix=/usr/aarch64-linux-gnu
> > --with-sysroot=/usr/aarch64-linux-gnu
> > --with-build-sysroot=/usr/aarch64-linux-gnu --libdir=/usr/lib
> > --libexecdir=/usr/lib --target=aarch64-linux-gnu
> > --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-nls
> > --enable-languages=c,c++ --enable-shared --enable-threads=posix
> > --with-system-zlib --with-isl --enable-__cxa_atexit
> > --disable-libunwind-exceptions --enable-clocale=gnu
> > --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
> > --enable-linker-build-id --enable-lto --enable-plugin
> > --enable-install-libiberty --with-linker-hash-style=gnu
> > --enable-gnu-indirect-function --disable-multilib --disable-werror
> > --enable-checking=release
> > Thread model: posix
> > gcc version 8.2.0 (GCC)
> >
> >
> > # build setup
> > make -j 8 config T=arm64-armv8a-linuxapp-gcc CROSS=aarch64-linux-
> gnu-
> > make -j 8 test-build CROSS=aarch64-linux-gnu-
> >
> > # generate asm
> > aarch64-linux-gnu-gdb -batch -ex 'file build/app/test ' -ex
> 'disassemble /rs
> > bucket_enqueue_single'
> >
> > I have uploaded generated file for your convenience
> > with_atomic_load.txt(includes patch 1,2,3)
> > -----------------------
> > https://pastebin.com/SQ6w1yRu
> >
> > without_atomic_load.txt(includes patch 2,3)
> > -----------------------
> > https://pastebin.com/BpvnD0CA
> >
> >
> > without_atomic
> > -------------
> > 23 if (!single)
> > 0x000000000068d290 <+240>: 85 00 00 35 cbnz w5, 0x68d2a0
> > <bucket_enqueue_single+256>
> > 0x000000000068d294 <+244>: 82 04 40 b9 ldr w2, [x4, #4]
> > 0x000000000068d298 <+248>: 5f 00 01 6b cmp w2, w1
> > 0x000000000068d29c <+252>: 21 01 00 54 b.ne 0x68d2c0
> > <bucket_enqueue_single+288> // b.any
> >
> > 24 while (unlikely(ht->tail != old_val))
> > 25 rte_pause();
> >
> >
> > with_atomic
> > -----------
> > 23 if (!single)
> > 0x000000000068ceb0 <+240>: 00 10 04 91 add x0, x0, #0x104
> > 0x000000000068ceb4 <+244>: 84 00 00 35 cbnz w4, 0x68cec4
> > <bucket_enqueue_single+260>
> > 0x000000000068ceb8 <+248>: 02 00 40 b9 ldr w2, [x0]
> > 0x000000000068cebc <+252>: 3f 00 02 6b cmp w1, w2
> > 0x000000000068cec0 <+256>: 01 09 00 54 b.ne 0x68cfe0
> > <bucket_enqueue_single+544> // b.any
> >
> > 24 while (unlikely(old_val != __atomic_load_n(&ht->tail,
> > __ATOMIC_RELAXED)))
> >
> >
> > I don't want to block this series of patches due this patch. Can we
> make
> > re spin one series with 2 and 3 patches. And Wait for patch 1 to
> conclude?
> >
> > Thoughts?
> >
> >
> >
> >
> > >
> > > > But argument that we shouldn't assume 32bit load/store ops as
> > atomic
> > > > sounds a bit flaky to me.
> > > > Konstantin
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > But there are no memory ordering requirements (with
> > > > > > regards to other loads and/or stores by this thread) so
> > relaxed
> > > > memory order is sufficient.
> > > > > > Another aspect of using __atomic_load_n() is that the
> > > > > compiler cannot "optimise" this load (e.g. combine, hoist etc), it
> has
> > to be
> > > > done as
> > > > > > specified in the source code which is also what we need
> here.
> > > > >
> > > > > I think Jerin points that rte_pause() acts here as compiler
> > barrier too,
> > > > > so no need to worry that compiler would optimize out the
> loop.
> > > > > [Ola] Sorry missed that. But the barrier behaviour of
> rte_pause()
> > > > > is not part of C11, is it essentially a hand-made feature to
> support
> > > > > the legacy multithreaded memory model (which uses explicit
> HW
> > and
> > > > compiler barriers). I'd prefer code using the C11 memory model
> not to
> > > > depend on such legacy features.
> > > > >
> > > > >
> > > > >
> > > > > Konstantin
> > > > >
> > > > > >
> > > > > > One point worth mentioning though is that this change is
> for
> > > > > the rte_ring_c11_mem.h file, not the legacy ring. It may be
> worth
> > persisting
> > > > > > with getting the C11 code right when people are less
> excited
> > about
> > > > sending a release out?
> > > > > >
> > > > > > We can explain that for C11 we would prefer to do loads
> and
> > stores
> > > > as per the C11 memory model. In the case of rte_ring, the code is
> > > > > > separated cleanly into C11 specific files anyway.
> > > > > >
> > > > > > I think reading ht->tail using __atomic_load_n() is the
> most
> > > > appropriate way. We show that ht->tail is used for
> synchronization,
> > we
> > > > > > acknowledge that ht->tail may be written by other
> threads
> > > > > without any other kind of synchronization (e.g. no lock involved)
> > and we
> > > > require
> > > > > > an atomic load (any write to ht->tail must also be atomic).
> > > > > >
> > > > > > Using volatile and explicit compiler (or processor)
> memory
> > barriers
> > > > (fences) is the legacy pre-C11 way of accomplishing these things.
> > > > > There's
> > > > > > a reason why C11/C++11 moved away from the old ways.
> > > > > > > >
> > > > > > > > __atomic_store_n(&ht->tail, new_val,
> > __ATOMIC_RELEASE);
> > > > > > > > --
> > > > > > > > 2.7.4
> > > > > > > >
> > > > >
> > > > >
> > > > >
> > >
> >
>
>
More information about the dev
mailing list