[PATCH v9 1/8] eal: generic 64 bit counter
Stephen Hemminger
stephen at networkplumber.org
Wed May 22 21:51:53 CEST 2024
On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff <roretzla at linux.microsoft.com> wrote:
> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> > > Sent: Wednesday, 22 May 2024 17.38
> > >
> > > On Wed, 22 May 2024 10:31:39 +0200
> > > Morten Brørup <mb at smartsharesystems.com> wrote:
> > >
> > > > > +/* On 32 bit platform, need to use atomic to avoid load/store
> > > tearing */
> > > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> > > >
> > > > As shown by Godbolt experiments discussed in a previous thread [2],
> > > non-tearing 64 bit counters can be implemented without using atomic
> > > instructions on all 32 bit architectures supported by DPDK. So we should
> > > use the counter/offset design pattern for RTE_ARCH_32 too.
> > > >
> > > > [2]:
> > > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > > erver.smartshare.dk/
> > >
> > >
> > > This code built with -O3 and -m32 on godbolt shows split problem.
> > >
> > > #include <stdint.h>
> > >
> > > typedef uint64_t rte_counter64_t;
> > >
> > > void
> > > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > > {
> > > *counter += val;
> > > }
> > > … *counter = val;
> > > }
> > >
> > > rte_counter64_add:
> > > push ebx
> > > mov eax, DWORD PTR [esp+8]
> > > xor ebx, ebx
> > > mov ecx, DWORD PTR [esp+12]
> > > add DWORD PTR [eax], ecx
> > > adc DWORD PTR [eax+4], ebx
> > > pop ebx
> > > ret
> > >
> > > rte_counter64_read:
> > > mov eax, DWORD PTR [esp+4]
> > > mov edx, DWORD PTR [eax+4]
> > > mov eax, DWORD PTR [eax]
> > > ret
> > > rte_counter64_set:
> > > movq xmm0, QWORD PTR [esp+8]
> > > mov eax, DWORD PTR [esp+4]
> > > movq QWORD PTR [eax], xmm0
> > > ret
> >
> > Sure, atomic might be required on some 32 bit architectures and/or with some compilers.
>
> in theory i think you should be able to use generic atomics and
> depending on the target you get codegen that works. it might be
> something more expensive on 32-bit and nothing on 64-bit etc..
>
> what's the damage if we just use atomic generic and relaxed ordering? is
> the codegen not optimal?
If we use atomic with relaxed memory order, then compiler for x86 still generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.
PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that platform
anymore.
More information about the dev
mailing list