[RFC v3] net/af_packet: make stats reset reliable
Mattias Rönnblom
hofors at lysator.liu.se
Sun May 26 09:10:12 CEST 2024
On 2024-05-10 11:14, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
>> Sent: Friday, 10 May 2024 06.56
>>
>> On Thu, 9 May 2024 16:19:08 +0200
>> Morten Brørup <mb at smartsharesystems.com> wrote:
>>
>>>> From: Morten Brørup [mailto:mb at smartsharesystems.com]
>>>> Sent: Thursday, 9 May 2024 13.37
>>>>
>>>>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
>>>>> Sent: Thursday, 9 May 2024 11.30
>>>>>
>>>>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
>>>>>>> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
>>>>>>> Sent: Wednesday, 8 May 2024 22.54
>>>>>>>
>>>>>>> On Wed, 8 May 2024 20:48:06 +0100
>>>>>>> Ferruh Yigit <ferruh.yigit at amd.com> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> The idea of load tearing is crazy talk of integral types.
>> It
>>>>> would
>>>>>>> break so many things.
>>>>>>>>> It is the kind of stupid compiler thing that would send
>> Linus
>>>> on
>>>>> a
>>>>>>> rant and get
>>>>>>>>> the GCC compiler writers in trouble.
>>>>>>>>>
>>>>>>>>> The DPDK has always favored performance over strict safety
>>>> guard
>>>>>>> rails everywhere.
>>>>>>>>> Switching to making every statistic an atomic operation is
>> not
>>>>> in
>>>>>>> the spirit of
>>>>>>>>> what is required. There is no strict guarantee necessary
>> here.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I kind of agree with Stephen.
>>>>>>>>
>>>>>>>> Thanks Mattias, Morten & Stephen, it was informative
>> discussion.
>>>>> But
>>>>>>> for
>>>>>>>> *SW drivers* stats update and reset is not core
>> functionality
>>>> and
>>>>> I
>>>>>>>> think we can be OK to get hit on corner cases, instead of
>>>>>>>> over-engineering or making code more complex.
>>>>>>>
>>>>>>>
>>>>>>> I forgot the case of 64 bit values on 32 bit platforms!
>>>>>>> Mostly because haven't cared about 32 bit for years...
>>>>>>>
>>>>>>> The Linux kernel uses some wrappers to handle this.
>>>>>>> On 64 bit platforms they become noop.
>>>>>>> On 32 bit platform, they are protected by a seqlock and
>> updates
>>>> are
>>>>>>> wrapped by the sequence count.
>>>>>>>
>>>>>>> If we go this way, then doing similar Noop on 64 bit and
>> atomic or
>>>>>>> seqlock
>>>>>>> on 32 bit should be done, but in common helper.
>>>>>>>
>>>>>>> Looking inside FreeBSD, it looks like that has changed over
>> the
>>>>> years as
>>>>>>> well.
>>>>>>>
>>>>>>> if_inc_counter
>>>>>>> counter_u64_add
>>>>>>> atomic_add_64
>>>>>>> But the counters are always per-cpu in this case. So although
>> it
>>>>> does
>>>>>>> use
>>>>>>> locked operation, will always be uncontended.
>>>>>>>
>>>>>>>
>>>>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be
>>>>> dropped
>>>>>>> this cycle?
>>>>>>
>>>>>> We cannot drop 32 bit architecture support altogether.
>>>>>>
>>>>>> But, unlike the Linux kernel, DPDK doesn't need to support
>> ancient
>>>> 32
>>>>> bit architectures.
>>>>>> If the few 32 bit architectures supported by DPDK provide non-
>>>> tearing
>>>>> 64 bit loads/stores, we don't need locks (in the fast path) for 64
>> bit
>>>>> counters.
>>>>>>
>>>>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
>>>>> architecture) and 32 bit ARMv8.
>>>>>> I don't think DPDK support any other 32 bit architectures.
>>>>>>
>>>>>>
>>>>>> As Mattias mentioned, 32 bit x86 can use xmm registers to
>> provide 64
>>>>> bit non-tearing load/store.
>>>>>>
>>>>>
>>>>> Testing this a little in godbolt, I see gcc using xmm registers on
>> 32-
>>>>> bit
>>>>> when updating 64-bit counters, but clang doesn't seem to do so,
>> but
>>>>> instead
>>>>> does 2 stores when writing back the 64 value. (I tried with both
>>>>> volatile
>>>>> and non-volatile 64-bit values, just to see if volatile would
>>>> encourage
>>>>> clang to do a single store).
>>>>>
>>>>> GCC: https://godbolt.org/z/9eqKfT3hz
>>>>> Clang: https://godbolt.org/z/PT5EqKn4c
>>>>
>>>> Interesting.
>>>> I guess this can be fixed by manually implementing what GCC does.
>>>>
>>>> I'm more concerned about finding a high-performance (in the fast
>> path)
>>>> 64 bit counter solution for 32 bit ARM.
>>>
>>> Reading up on the topic, and continuing Bruce's experiment on Godbolt,
>> it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store
>> Register Dual) instructions, which load/store 64 bit from memory into
>> two registers at once.
>>>
>>> Clang is emits more efficient code without volatile.
>>> GCC requires volatile to use STRD.
>>>
>>> Clang: https://godbolt.org/z/WjdTq6EKh
>>> GCC: https://godbolt.org/z/qq9j7d4Ea
>>>
>>> Summing it up, it is possible to implement non-tearing 64 bit high-
>> performance (lockless, barrier-free) counters on the 32 bit
>> architectures supported by DPDK.
>>>
>>> But the implementation is both architecture and compiler specific.
>>> So it seems a "64 bit counters" library would be handy. (Or a "non-
>> tearing 64 bit integers" library, for support of the signed variant too;
>> but I don't think we need that.)
>>> We can use uint64_t as the underlying type and type cast in the
>> library (where needed by the specific compiler/architecture), or
>> introduce a new rte_ctr64_t type to ensure that accessor functions are
>> always used and the developer doesn't have to worry about tearing on 32
>> bit architectures.
>>>
>>> The most simple variant of such a library only provides load and store
>> functions. The API would look like this:
>>>
>>> uint64_t inline
>>> rte_ctr64_get(const rte_ctr64_t *const ctr);
>>>
>>> void inline
>>> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
>>>
>>> And if some CPU offers special instructions for increment or addition,
>> faster (regarding performance) and/or more compact (regarding
>> instruction memory) than a sequence of load-add-store instructions:
>>>
>>> void inline
>>> rte_ctr64_inc(rte_ctr64_t *const ctr);
>>>
>>> void inline
>>> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
>
> Note: 32 bit architectures might achieve higher performance if the "value" parameter to rte_ctr64_add() is unsigned long (or unsigned int) instead of uint64_t.
>
If the rte_ctr64_add() implementation is kept in the header file (or,
always in case of LTO), I don't see how that could make a difference,
assuming the operand is of a smaller (<64 bits) type.
>>>
>>> <feature creep>
>>> And perhaps atomic variants of all these functions, with explicit
>> and/or relaxed memory ordering, for counters shared by multiple threads.
>>> </feature creep>
>>>
>>
>>
>> This kind of what I am experimenting with but...
>
> Excellent!
>
>> Intend to keep the details of the counters read and update in one file
>> and not as inlines.
>
> Please note that traffic management applications maintain many counters (e.g. per-flow, per-subscriber and per-link packet and byte counters, some also per QoS class), so rte_ctr64_add() must have the highest performance technically possible.
>
> For reference, the packet scheduling code in our application updates 28 statistics counters per burst of packets. (In addition to internal state variables for e.g. queue lenghts.)
>
> Furthermore, our application processes and displays live statistics with one second granularity, using a separate thread. Although statistics processing is not part of the fast path, the sheer number of counters processed requires high performance read access to those counters.
>
>
> <more feature creep>
> Some groups of counters are maintained locally in the inner loop, and then added in bulk to the "public" statistics afterwards. Conceptually:
>
> struct stats_per_prio {
> uint64_t forwarded_bytes;
> uint64_t forwarded_packets;
> uint64_t marked_bytes;
> uint64_t marked_packets;
> uint64_t dropped_bytes;
> uint64_t dropped_packets;
> };
>
> If this is a common design pattern in DPDK (drivers, libraries and/or applications), perhaps also provide a performance optimized function for architectures offering vector instructions:
>
> void inline
> rte_ctr64_add_bulk(
> rte_ctr64_t *const ctrs,
> const unsigned long *const values,
> const unsigned int n /* compile time constant */);
>
> This slightly resembles the Linux kernel's design pattern, where counters are updated in bulk, protected by a common lock for the bulk update. (However, DPDK has no lock, so the motivation for optimizing for this design pattern is only "nice to have".)
>
> PS:
> Many DPDK applications are 64 bit only, including the SmartShare appliances, and can easily manage 64 bit counters without all this.
> However, if the DPDK project still considers 32 bit architectures first class citizens, 64 bit counters should have high performance on 32 bit architectures too.
>
More information about the dev
mailing list