[PATCH v7 1/4] eal: add lcore poll busyness telemetry
    Konstantin Ananyev 
    konstantin.v.ananyev at yandex.ru
       
    Sat Oct  1 16:17:26 CEST 2022
    
    
  
>> Hi Kevin,
>>
>>>>> Currently, there is no way to measure lcore poll busyness in a 
>>>>> passive way,
>>>>> without any modifications to the application. This patch adds a new 
>>>>> EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most 
>>>>> DPDK API's
>>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>>> polls can be counted as "idle", while non-empty polls can be 
>>>>> counted as
>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>> timestamping function with the number of polls a particular code 
>>>>> section
>>>>> has processed, and count the number of cycles we've spent 
>>>>> processing empty
>>>>> bursts. The more empty bursts we encounter, the less cycles we 
>>>>> spend in
>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>
>>>>> In order for all of the above to work without modifications to the
>>>>> application, the library code needs to be instrumented with calls 
>>>>> to the
>>>>> lcore telemetry busyness timestamping function. The following parts 
>>>>> of DPDK
>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>
>>>>> - All major driver API's:
>>>>>     - ethdev
>>>>>     - cryptodev
>>>>>     - compressdev
>>>>>     - regexdev
>>>>>     - bbdev
>>>>>     - rawdev
>>>>>     - eventdev
>>>>>     - dmadev
>>>>> - Some additional libraries:
>>>>>     - ring
>>>>>     - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support, a 
>>>>> global
>>>>> variable is exported by EAL, and a call to timestamping function is 
>>>>> wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only takes 
>>>>> one
>>>>> additional branch and no function calls are performed. It is 
>>>>> disabled at
>>>>> compile time by default.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll 
>>>>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain 
>>>>> the usage
>>>>> of the new telemetry endpoints and API.
>>>> As was already mentioned  by other reviewers, it would be much better
>>>> to let application itself decide when it is idle and when it is busy.
>>>> With current approach even for constant polling run-to-completion 
>>>> model there
>>>> are plenty of opportunities to get things wrong and provide 
>>>> misleading statistics.
>>>> My special concern - inserting it into ring dequeue code.
>>>> Ring is used for various different things, not only pass packets 
>>>> between threads (mempool, etc.).
>>>> Blindly assuming that ring dequeue returns empty means idle cycles 
>>>> seams wrong to me.
>>>> Which make me wonder should we really hard-code these calls into 
>>>> DPDK core functions?
>>>> If you still like to introduce such stats, might be better to 
>>>> implement it via callback mechanism.
>>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>>> That way our generic code   will remain unaffected, plus user will 
>>>> have ability to enable/disable
>>>> it on a per device basis.
>>> Thanks for your feedback, Konstantin.
>>>
>>> You are right in saying that this approach won't be 100% suitable for
>>> all use-cases, but should be suitable for the majority of applications.
>> First of all - could you explain how did you measure what is the 
>> 'majority' of DPDK applications?
>> And how did you conclude that it definitely work for all the apps in 
>> that 'majority'?
>> Second what bother me with that approach - I don't see s clear and 
>> deterministic way
>> for the user to understand would that stats work properly for his app 
>> or not.
>> (except manually ananlyzing his app code).
> 
> All of the DPDK example applications we've tested with (l2fwd, l3fwd + 
> friends, testpmd, distributor, dmafwd) report lcore poll busyness and 
> respond to changing traffic rates etc. We've also compared the reported 
> busyness to similar metrics reported by other projects such as VPP and 
> OvS, and found the reported busyness matches with a difference of +/- 
> 1%. In addition to the DPDK example applications, we've have shared our 
> plans with end customers and they have confirmed that the design should 
> work with their applications.
I am sure l3fwd and testpmd should be ok, I am talking about
something more complicated/unusual.
Below are few examples on top of my head when I think your approach
will generate invalid stats, feel free to correct me, if I am wrong.
1) App doing some sort of bonding itslef, i.e:
struct rte_mbuf pkts[N*2];
k = rte_eth_rx_burst(p0, q0, pkts, N);
n = rte_eth_rx_burst(p1, q1, pkts + k, N);
/*process all packets from both ports at once */
if (n + k != 0)
    process_pkts(pkts, n + k);
Now, as I understand, if n==0, then all cycles spent
in process_pkts() will be accounted as idle.
2) App doing something similar to what pdump library does
(creates a copy of a packet and sends it somewhere).
n =rte_eth_rx_burst(p0, q0, &pkt, 1);
if (n != 0) {
   dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
   if (dup_pkt != NULL)
      process_dup_pkt(dup_pkt);
   process_pkt(pkt);
}
that relates to ring discussion below:
if there are no mbufs in dup_mp, then ring_deque() will fail
and process_pkt() will be accounted as idle.
3) App dequeues from ring in a bit of unusual way:
/* idle spin loop */
while ((n = rte_ring_count(ring)) == 0)
   ret_pause();
n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
if (n != 0)
   process_pkts(pkts, n);
here, we can end-up accounting cycles spent in
idle spin loop as busy.
4) Any thread that generates TX traffic on it's own
(something like testpmd tx_only fwd mode)
5) Any thread that depends on both dequeue and enqueue:
n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
...
/* loop till all packets are sent out successfully */
while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
    rte_pause();
Now, if n > 0, all cycles spent in enqueue() will be accounted
as 'busy', though from my perspective they probably should
be considered as 'idle'.
Also I expect some problems when packet processing is done inside
rx callbacks, but that probably can be easily fixed.
> 
>>> It's worth keeping in mind that this feature is compile-time disabled by
>>> default, so there is no impact to any application/user that does not
>>> wish to use this, for example applications where this type of busyness
>>> is not useful, or for applications that already use other mechanisms to
>>> report similar telemetry.
>> Not sure that adding in new compile-time option disabled by default is 
>> a good thing...
>> For me it would be much more preferable if we'll go through a more 
>> 'standard' way here:
>> a) define clear API to enable/disable/collect/report such type of stats.
>> b) use some of our sample apps to demonstrate how to use it properly 
>> with user-specific code.
>> c) if needed implement some 'silent' stats collection for limited 
>> scope of apps via callbacks -
>> let say for run-to-completion apps that do use ether and crypto devs 
>> only.
> 
> With the compile-time option, its just one build flag for lots of 
> applications to silently benefit from this.
There could be a lot of useful and helpfull stats
that user would like to collect (idle/busy, processing latency, etc.).
But, if for each such case we will hard-code new stats collection
into our fast data-path code, then very soon it will become
completely bloated and unmaintainable.
I think we need some generic approach for such extra stats collection.
Callbacks could be one way, Jerin in another mail suggested using 
existing trace-point hooks, might be it worth to explore it further.
> 
>>   However, the upside for applications that do
>>> wish to use this is that there are no code changes required (for the
>>> most part), the feature simply needs to be enabled at compile-time via
>>> the meson option.
>>>
>>> In scenarios where contextual awareness of the application is needed in
>>> order to report more accurate "busyness", the
>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>> sections of code as "busy" or "idle". This way, the application can
>>> assume control of determining the poll busyness of its lcores while
>>> leveraging the telemetry hooks adding in this patchset.
>>>
>>> We did initially consider implementing this via callbacks, however we
>>> found this approach to have 2 main drawbacks:
>>> 1. Application changes are required for all applications wanting to
>>> report this telemetry - rather than the majority getting it for free.
>> Didn't get it - why callbacks approach would require user-app changes?
>> In other situations - rte_power callbacks, pdump, etc. it works 
>> transparent to
>> user-leve code.
>> Why it can't be done here in a similar way?
> 
>  From my understanding, the callbacks would need to be registered by the 
> application at the very least (and the callback would have to be 
> registered per device/pmd/lib).
Callbacks can be registered by library itself.
AFAIK, latenc-ystats, power and pdump libraries - all use similar 
approach. user calls something like xxx_stats_enable() and then library 
can iterate over all available devices and setup necessary callbacks.
same for xxx_stats_disable().
>>
>>> 2. Ring does not have callback support, meaning pipelined applications
>>> could not report lcore poll busyness telemetry with this approach.
>> That's another big concern that I have:
>> Why you consider that all rings will be used for a pipilines between 
>> threads and should
>> always be accounted by your stats?
>> They could be used for dozens different purposes.
>> What if that ring is used for mempool, and ring_dequeue() just means 
>> we try to allocate
>> an object from the pool? In such case, why failing to allocate an 
>> object should mean
>> start of new 'idle cycle'?
> 
> Another approach could be taken here if the mempool interactions are of 
> concern.
> 
>  From our understanding, mempool operations use the "_bulk" APIs, 
> whereas polling operations use the "_burst" APIs. Would only 
> timestamping on the "_burst" APIs be better here? That way the mempool 
> interactions won't be counted towards the busyness.
Well, it would help to solve one particular case,
but in general I still think it is incomplete and error-prone.
What if pipeline app will use ring_count/ring_dequeue_bulk(),
or even ZC ring API?
What if app will use something different from rte_ring to pass
packets between threads/processes?
As I said before, without some clues from the app, it is probably
not possible to collect such stats in a proper way.
> Including support for pipelined applications using rings is key for a 
> number of usecases, this was highlighted as part of the customer 
> feedback when we shared the design.
> 
>>
>>> Eventdev is another driver which would be completely missed with this
>>> approach.
>> Ok, I see two ways here:
>> - implement CB support for eventdev.
>> -meanwhile clearly document that this stats are not supported for 
>> eventdev  scenarios (yet).
    
    
More information about the dev
mailing list