[RFC v3] net/af_packet: make stats reset reliable
Mattias Rönnblom
hofors at lysator.liu.se
Sun May 26 09:07:00 CEST 2024
On 2024-05-09 16:19, Morten Brørup wrote:
>> From: Morten Brørup [mailto:mb at smartsharesystems.com]
>> Sent: Thursday, 9 May 2024 13.37
>>
>>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
>>> Sent: Thursday, 9 May 2024 11.30
>>>
>>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
>>>>> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
>>>>> Sent: Wednesday, 8 May 2024 22.54
>>>>>
>>>>> On Wed, 8 May 2024 20:48:06 +0100
>>>>> Ferruh Yigit <ferruh.yigit at amd.com> wrote:
>>>>>
>>>>>>>
>>>>>>> The idea of load tearing is crazy talk of integral types. It
>>> would
>>>>> break so many things.
>>>>>>> It is the kind of stupid compiler thing that would send Linus
>> on
>>> a
>>>>> rant and get
>>>>>>> the GCC compiler writers in trouble.
>>>>>>>
>>>>>>> The DPDK has always favored performance over strict safety
>> guard
>>>>> rails everywhere.
>>>>>>> Switching to making every statistic an atomic operation is not
>>> in
>>>>> the spirit of
>>>>>>> what is required. There is no strict guarantee necessary here.
>>>>>>>
>>>>>>
>>>>>> I kind of agree with Stephen.
>>>>>>
>>>>>> Thanks Mattias, Morten & Stephen, it was informative discussion.
>>> But
>>>>> for
>>>>>> *SW drivers* stats update and reset is not core functionality
>> and
>>> I
>>>>>> think we can be OK to get hit on corner cases, instead of
>>>>>> over-engineering or making code more complex.
>>>>>
>>>>>
>>>>> I forgot the case of 64 bit values on 32 bit platforms!
>>>>> Mostly because haven't cared about 32 bit for years...
>>>>>
>>>>> The Linux kernel uses some wrappers to handle this.
>>>>> On 64 bit platforms they become noop.
>>>>> On 32 bit platform, they are protected by a seqlock and updates
>> are
>>>>> wrapped by the sequence count.
>>>>>
>>>>> If we go this way, then doing similar Noop on 64 bit and atomic or
>>>>> seqlock
>>>>> on 32 bit should be done, but in common helper.
>>>>>
>>>>> Looking inside FreeBSD, it looks like that has changed over the
>>> years as
>>>>> well.
>>>>>
>>>>> if_inc_counter
>>>>> counter_u64_add
>>>>> atomic_add_64
>>>>> But the counters are always per-cpu in this case. So although it
>>> does
>>>>> use
>>>>> locked operation, will always be uncontended.
>>>>>
>>>>>
>>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be
>>> dropped
>>>>> this cycle?
>>>>
>>>> We cannot drop 32 bit architecture support altogether.
>>>>
>>>> But, unlike the Linux kernel, DPDK doesn't need to support ancient
>> 32
>>> bit architectures.
>>>> If the few 32 bit architectures supported by DPDK provide non-
>> tearing
>>> 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
>>> counters.
>>>>
>>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
>>> architecture) and 32 bit ARMv8.
>>>> I don't think DPDK support any other 32 bit architectures.
>>>>
>>>>
>>>> As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
>>> bit non-tearing load/store.
>>>>
>>>
>>> Testing this a little in godbolt, I see gcc using xmm registers on 32-
>>> bit
>>> when updating 64-bit counters, but clang doesn't seem to do so, but
>>> instead
>>> does 2 stores when writing back the 64 value. (I tried with both
>>> volatile
>>> and non-volatile 64-bit values, just to see if volatile would
>> encourage
>>> clang to do a single store).
>>>
>>> GCC: https://godbolt.org/z/9eqKfT3hz
>>> Clang: https://godbolt.org/z/PT5EqKn4c
>>
>> Interesting.
>> I guess this can be fixed by manually implementing what GCC does.
>>
>> I'm more concerned about finding a high-performance (in the fast path)
>> 64 bit counter solution for 32 bit ARM.
>
> Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) instructions, which load/store 64 bit from memory into two registers at once.
>
> Clang is emits more efficient code without volatile.
> GCC requires volatile to use STRD.
>
> Clang: https://godbolt.org/z/WjdTq6EKh
> GCC: https://godbolt.org/z/qq9j7d4Ea
>
> Summing it up, it is possible to implement non-tearing 64 bit high-performance (lockless, barrier-free) counters on the 32 bit architectures supported by DPDK.
>
> But the implementation is both architecture and compiler specific.
> So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 bit integers" library, for support of the signed variant too; but I don't think we need that.)
> We can use uint64_t as the underlying type and type cast in the library (where needed by the specific compiler/architecture), or introduce a new rte_ctr64_t type to ensure that accessor functions are always used and the developer doesn't have to worry about tearing on 32 bit architectures.
>
> The most simple variant of such a library only provides load and store functions. The API would look like this:
>
> uint64_t inline
> rte_ctr64_get(const rte_ctr64_t *const ctr);
>
> void inline
> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
>
> And if some CPU offers special instructions for increment or addition, faster (regarding performance) and/or more compact (regarding instruction memory) than a sequence of load-add-store instructions:
>
> void inline
> rte_ctr64_inc(rte_ctr64_t *const ctr);
>
> void inline
> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
>
> <feature creep>
> And perhaps atomic variants of all these functions, with explicit and/or relaxed memory ordering, for counters shared by multiple threads.
Isn't rte_ctr64_get() already an atomic load with relaxed memory ordering?
> </feature creep>
>
More information about the dev
mailing list