[RFC v2] net/af_packet: make stats reset reliable

Mattias Rönnblom hofors at lysator.liu.se
Thu May 2 19:37:28 CEST 2024
Previous message (by thread): [RFC v2] net/af_packet: make stats reset reliable
Next message (by thread): [RFC v2] net/af_packet: make stats reset reliable
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2024-05-02 16:22, Ferruh Yigit wrote:
> On 5/2/2024 6:51 AM, Mattias Rönnblom wrote:
>> On 2024-05-01 18:19, Ferruh Yigit wrote:
>>> On 4/28/2024 4:11 PM, Mattias Rönnblom wrote:
>>>> On 2024-04-26 16:38, Ferruh Yigit wrote:
>>>>> For stats reset, use an offset instead of zeroing out actual stats
>>>>> values,
>>>>> get_stats() displays diff between stats and offset.
>>>>> This way stats only updated in datapath and offset only updated in
>>>>> stats
>>>>> reset function. This makes stats reset function more reliable.
>>>>>
>>>>> As stats only written by single thread, we can remove 'volatile'
>>>>> qualifier
>>>>> which should improve the performance in datapath.
>>>>>
>>>>
>>>> volatile wouldn't help you if you had multiple writers, so that can't be
>>>> the reason for its removal. It would be more accurate to say it should
>>>> be replaced with atomic updates. If you don't use volatile and don't use
>>>> atomics, you have to consider if the compiler can reach the conclusion
>>>> that it does not need to store the counter value for future use *for
>>>> that thread*. Since otherwise, I don't think the store actually needs to
>>>> occur. Since DPDK statistics tend to work, it's pretty obvious that
>>>> current compilers tend not to reach this conclusion.
>>>>
>>>
>>> Thanks Mattias for clarifying why we need volatile or atomics even with
>>> single writer.
>>>
>>>> If this should be done 100% properly, the update operation should be a
>>>> non-atomic load, non-atomic add, and an atomic store. Similarly, for the
>>>> reset, the offset store should be atomic.
>>>>
>>>
>>> ack
>>>
>>>> Considered the state of the rest of the DPDK code base, I think a
>>>> non-atomic, non-volatile solution is also fine.
>>>>
>>>
>>> Yes, this seems working practically but I guess better to follow above
>>> suggestion.
>>>
>>>> (That said, I think we're better off just deprecating stats reset
>>>> altogether, and returning -ENOTSUP here meanwhile.)
>>>>
>>>
>>> As long as reset is reliable (here I mean it reset stats in every call)
>>> and doesn't impact datapath performance, I am for to continue with it.
>>> Returning non supported won't bring more benefit to users.
>>>
>>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit at amd.com>
>>>>> ---
>>>>> Cc: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
>>>>> Cc: Stephen Hemminger <stephen at networkplumber.org>
>>>>> Cc: Morten Brørup <mb at smartsharesystems.com>
>>>>>
>>>>> This update triggered by mail list discussion [1].
>>>>>
>>>>> [1]
>>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>>>>
>>>>> v2:
>>>>> * Remove wrapping check for stats
>>>>> ---
>>>>>     drivers/net/af_packet/rte_eth_af_packet.c | 66
>>>>> ++++++++++++++---------
>>>>>     1 file changed, 41 insertions(+), 25 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db5886..10c8e1e50139 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>>>>         uint16_t in_port;
>>>>>         uint8_t vlan_strip;
>>>>>     -    volatile unsigned long rx_pkts;
>>>>> -    volatile unsigned long rx_bytes;
>>>>> +    uint64_t rx_pkts;
>>>>> +    uint64_t rx_bytes;
>>>>> +    uint64_t rx_pkts_offset;
>>>>> +    uint64_t rx_bytes_offset;
>>>>
>>>> I suggest you introduce a separate struct for reset-able counters. It'll
>>>> make things cleaner, and you can sneak in atomics without too much
>>>> atomics-related bloat.
>>>>
>>>> struct counter
>>>> {
>>>>       uint64_t count;
>>>>       uint64_t offset;
>>>> };
>>>>
>>>> /../
>>>>       struct counter rx_pkts;
>>>>       struct counter rx_bytes;
>>>> /../
>>>>
>>>> static uint64_t
>>>> counter_value(struct counter *counter)
>>>> {
>>>>       uint64_t count = __atomic_load_n(&counter->count,
>>>> __ATOMIC_RELAXED);
>>>>       uint64_t offset = __atomic_load_n(&counter->offset,
>>>> __ATOMIC_RELAXED);
>>>>
>>>>       return count + offset;
>>>> }
>>>>
>>>> static void
>>>> counter_reset(struct counter *counter)
>>>> {
>>>>       uint64_t count = __atomic_load_n(&counter->count,
>>>> __ATOMIC_RELAXED);
>>>>
>>>>       __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
>>>> }
>>>>
>>>> static void
>>>> counter_add(struct counter *counter, uint64_t operand)
>>>> {
>>>>       __atomic_store_n(&counter->count, counter->count + operand,
>>>> __ATOMIC_RELAXED);
>>>> }
>>>>
>>>
>>> Ack for separate struct for reset-able counters.
>>>
>>>> You'd have to port this to <rte_stdatomic.h> calls, which prevents
>>>> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be
>>>> replaced with explicit relaxed non-atomic load. Otherwise, if you just
>>>> use "counter->count", that would be an atomic load with sequential
>>>> consistency memory order on C11 atomics-based builds, which would result
>>>> in a barrier, at least on weakly ordered machines (e.g., ARM).
>>>>
>>>
>>> I am not sure I understand above.
>>> As load and add will be non-atomic, why not access them directly, like:
>>> `uint64_t count = counter->count;`
>>>
>>
>> In case count is _Atomic (i.e., on enable_stdatomic=true builds), "count
>> = counter->count" will imply a memory barrier. On x86_64, I think it
>> will "only" be a compiler barrier (i.e., preventing optimization). On
>> weakly ordered machines, it will result in a barrier-instruction (or an
>> instruction-which-is-also-a-barrier, like in the example below).
>>
>> include <stdatomic.h>
>>
>> int relaxed_load(_Atomic int *p)
>> {
>>      atomic_load_explicit(p, memory_order_relaxed);
>> }
>>
>> int direct_load(_Atomic int *p)
>> {
>>      return *p;
>> }
>>
>> GCC 13.2 ARM64 ->
>>
>> relaxed_load:
>>          ldr     w0, [x0]
>>          ret
>> direct_load:
>>          ldar    w0, [x0]
>>          ret
>>
>>
> 
> 
> Do we need to declare count as '_Atomic', I wasn't planning to make
> variable _Atomic. This way assignment won't introduce any memory barrier.
> 

To use atomics in DPDK, the current requirements seems to be to use 
RTE_ATOMIC(). That macro expands to _Atomic in enable_stdatomic=true 
builds, and nothing otherwise.

Carefully crafted code using atomics will achieved the same performance 
and be more correct than the non-atomic variant. However, in practice, I 
think the non-atomic variant is very likely to produce the desired results.

> 
>>> So my understanding is, remove `volatile`, load and add without atomics,
>>> and only use relaxed ordered atomics for store (to ensure value in
>>> register stored to memory).
>>>
>>
>> Yes, that would be the best option, would the DPDK atomics API allow its
>> implementation - but it doesn't. At least not if you care about what
>> happens in enable_stdatomic=true builds.
>>
>> The second-best option is to use a rte_memory_order_relaxed atomic load,
>> a regular non-atomic add, and a relaxed atomic store.
>>
>>> I will send a new version of RFC with above understanding.
>>>
>>>> I would still use a struct and some helper-functions even for the less
>>>> ambitious, non-atomic variant.
>>>>
>>>> The only drawback of using GCC built-ins type atomics here, versus an
>>>> atomic- and volatile-free approach, is that current compilers seems to
>>>> refuse merging atomic stores. It's beyond me why this is the case. If
>>>> you store to a variable twice in quick succession, it'll be two store
>>>> machine instructions, even in cases where the compiler *knows* the value
>>>> is identical. So volatile, even though you didn't ask for it. Weird.
>>>>
>>>> So if you have a loop, you may want to make an "counter_add()" in the
>>>> end from a temporary, to get the final 0.001% of performance.
>>>>
>>>
>>> ack
>>>
>>> I can't really say which one of the following is better (because of
>>> store in empty poll), but I will keep it as it is (b.):
>>>
>>> a.
>>> for (i < nb_pkt) {
>>>      stats =+ 1;
>>> }
>>>
>>>
>>> b.
>>> for (i < nb_pkt) {
>>>      tmp =+ 1;
>>> }
>>> stats += tmp;
>>>
>>>
>>> c.
>>> for (i < nb_pkt) {
>>>      tmp =+ 1;
>>> }
>>> if (tmp)
>>>      stats += tmp;
>>>
>>>
>>>
>>>> If the tech board thinks MT-safe reset-able software-manage statistics
>>>> is the future (as opposed to dropping reset support, for example), I
>>>> think this stuff should go into a separate header file, so other PMDs
>>>> can reuse it. Maybe out of scope for this patch.
>>>>
>>>
>>> I don't think we need MT-safe reset, the patch is already out to
>>> document current status.
>>
>> Well, what you are working on is a MT-safe reset, in the sense it allows
>> for one (1) resetting thread properly synchronize with multiple
>> concurrent counter-updating threads.
>>
>> It's not going to be completely MT safe, since you can't have two
>> threads calling the reset function in parallel.
>>
> 
> This is what I meant with "MT-safe reset", so multiple threads not
> allowed to call stats reset in parallel.
> 
> And for multiple concurrent counter-updating threads case, suggestion is
> to stop forwarding.
> 
> Above two are the update I added to 'rte_eth_stats_reset()' API, and I
> believe we can continue with this restriction for the API.
> 
>> Any change to the API should make this clear.
>>
>>> For HW stats reset is already reliable and for SW drives offset based
>>> approach can make is reliable.
>>>
>>> Unless you explicitly asked for it, I don't think this is in the agenda
>>> of the techboard.
>>>
>>>
>
Previous message (by thread): [RFC v2] net/af_packet: make stats reset reliable
Next message (by thread): [RFC v2] net/af_packet: make stats reset reliable
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list