[PATCH v9 1/8] eal: generic 64 bit counter

Mattias Rönnblom hofors at lysator.liu.se
Sun May 26 16:46:47 CEST 2024


On 2024-05-22 21:51, Stephen Hemminger wrote:
> On Wed, 22 May 2024 12:01:12 -0700
> Tyler Retzlaff <roretzla at linux.microsoft.com> wrote:
> 
>> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
>>>> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
>>>> Sent: Wednesday, 22 May 2024 17.38
>>>>
>>>> On Wed, 22 May 2024 10:31:39 +0200
>>>> Morten Brørup <mb at smartsharesystems.com> wrote:
>>>>    
>>>>>> +/* On 32 bit platform, need to use atomic to avoid load/store
>>>> tearing */
>>>>>> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
>>>>>
>>>>> As shown by Godbolt experiments discussed in a previous thread [2],
>>>> non-tearing 64 bit counters can be implemented without using atomic
>>>> instructions on all 32 bit architectures supported by DPDK. So we should
>>>> use the counter/offset design pattern for RTE_ARCH_32 too.
>>>>>
>>>>> [2]:
>>>> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
>>>> erver.smartshare.dk/
>>>>
>>>>
>>>> This code built with -O3 and -m32 on godbolt shows split problem.
>>>>
>>>> #include <stdint.h>
>>>>
>>>> typedef uint64_t rte_counter64_t;
>>>>
>>>> void
>>>> rte_counter64_add(rte_counter64_t *counter, uint32_t val)
>>>> {
>>>> 	*counter += val;
>>>> }
>>>> …	*counter = val;
>>>> }
>>>>
>>>> rte_counter64_add:
>>>>          push    ebx
>>>>          mov     eax, DWORD PTR [esp+8]
>>>>          xor     ebx, ebx
>>>>          mov     ecx, DWORD PTR [esp+12]
>>>>          add     DWORD PTR [eax], ecx
>>>>          adc     DWORD PTR [eax+4], ebx
>>>>          pop     ebx
>>>>          ret
>>>>
>>>> rte_counter64_read:
>>>>          mov     eax, DWORD PTR [esp+4]
>>>>          mov     edx, DWORD PTR [eax+4]
>>>>          mov     eax, DWORD PTR [eax]
>>>>          ret
>>>> rte_counter64_set:
>>>>          movq    xmm0, QWORD PTR [esp+8]
>>>>          mov     eax, DWORD PTR [esp+4]
>>>>          movq    QWORD PTR [eax], xmm0
>>>>          ret
>>>
>>> Sure, atomic might be required on some 32 bit architectures and/or with some compilers.
>>
>> in theory i think you should be able to use generic atomics and
>> depending on the target you get codegen that works. it might be
>> something more expensive on 32-bit and nothing on 64-bit etc..
>>
>> what's the damage if we just use atomic generic and relaxed ordering? is
>> the codegen not optimal?
> 
> If we use atomic with relaxed memory order, then compiler for x86 still generates
> a locked increment in the fast path. This costs about 100 extra cycles due
> to cache and prefetch stall. This whole endeavor is an attempt to avoid that.
> 

It's because the code is overly restrictive (e.g., needlessly forcing 
the whole read-modify-read being atomic), in that case, and no fault of 
the compiler.

void add(uint64_t *addr, uint64_t operand)
{
     uint64_t value = __atomic_load_n(addr, __ATOMIC_RELAXED);
     value += operand;
     __atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

->

x86_64

add:
         mov     rax, QWORD PTR [rdi]
         add     rax, rsi
         mov     QWORD PTR [rdi], rax
         ret


x86

add:
         sub     esp, 12
         mov     ecx, DWORD PTR [esp+16]
         movq    xmm0, QWORD PTR [ecx]
         movq    QWORD PTR [esp], xmm0
         mov     eax, DWORD PTR [esp]
         mov     edx, DWORD PTR [esp+4]
         add     eax, DWORD PTR [esp+20]
         adc     edx, DWORD PTR [esp+24]
         mov     DWORD PTR [esp], eax
         mov     DWORD PTR [esp+4], edx
         movq    xmm1, QWORD PTR [esp]
         movq    QWORD PTR [ecx], xmm1
         add     esp, 12
         ret

No locked instructions.

> PS: looking at the locked increment code for 32 bit involves locked compare
> exchange and potential retry. Probably don't care about performance on that platform
> anymore.
> 
> 


More information about the dev mailing list