[PATCH v7 1/4] eal: add lcore poll busyness telemetry
Konstantin Ananyev
konstantin.v.ananyev at yandex.ru
Sat Oct 1 16:17:26 CEST 2022
>> Hi Kevin,
>>
>>>>> Currently, there is no way to measure lcore poll busyness in a
>>>>> passive way,
>>>>> without any modifications to the application. This patch adds a new
>>>>> EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most
>>>>> DPDK API's
>>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>>> polls can be counted as "idle", while non-empty polls can be
>>>>> counted as
>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>> timestamping function with the number of polls a particular code
>>>>> section
>>>>> has processed, and count the number of cycles we've spent
>>>>> processing empty
>>>>> bursts. The more empty bursts we encounter, the less cycles we
>>>>> spend in
>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>
>>>>> In order for all of the above to work without modifications to the
>>>>> application, the library code needs to be instrumented with calls
>>>>> to the
>>>>> lcore telemetry busyness timestamping function. The following parts
>>>>> of DPDK
>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>
>>>>> - All major driver API's:
>>>>> - ethdev
>>>>> - cryptodev
>>>>> - compressdev
>>>>> - regexdev
>>>>> - bbdev
>>>>> - rawdev
>>>>> - eventdev
>>>>> - dmadev
>>>>> - Some additional libraries:
>>>>> - ring
>>>>> - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support, a
>>>>> global
>>>>> variable is exported by EAL, and a call to timestamping function is
>>>>> wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only takes
>>>>> one
>>>>> additional branch and no function calls are performed. It is
>>>>> disabled at
>>>>> compile time by default.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll
>>>>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain
>>>>> the usage
>>>>> of the new telemetry endpoints and API.
>>>> As was already mentioned by other reviewers, it would be much better
>>>> to let application itself decide when it is idle and when it is busy.
>>>> With current approach even for constant polling run-to-completion
>>>> model there
>>>> are plenty of opportunities to get things wrong and provide
>>>> misleading statistics.
>>>> My special concern - inserting it into ring dequeue code.
>>>> Ring is used for various different things, not only pass packets
>>>> between threads (mempool, etc.).
>>>> Blindly assuming that ring dequeue returns empty means idle cycles
>>>> seams wrong to me.
>>>> Which make me wonder should we really hard-code these calls into
>>>> DPDK core functions?
>>>> If you still like to introduce such stats, might be better to
>>>> implement it via callback mechanism.
>>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>>> That way our generic code will remain unaffected, plus user will
>>>> have ability to enable/disable
>>>> it on a per device basis.
>>> Thanks for your feedback, Konstantin.
>>>
>>> You are right in saying that this approach won't be 100% suitable for
>>> all use-cases, but should be suitable for the majority of applications.
>> First of all - could you explain how did you measure what is the
>> 'majority' of DPDK applications?
>> And how did you conclude that it definitely work for all the apps in
>> that 'majority'?
>> Second what bother me with that approach - I don't see s clear and
>> deterministic way
>> for the user to understand would that stats work properly for his app
>> or not.
>> (except manually ananlyzing his app code).
>
> All of the DPDK example applications we've tested with (l2fwd, l3fwd +
> friends, testpmd, distributor, dmafwd) report lcore poll busyness and
> respond to changing traffic rates etc. We've also compared the reported
> busyness to similar metrics reported by other projects such as VPP and
> OvS, and found the reported busyness matches with a difference of +/-
> 1%. In addition to the DPDK example applications, we've have shared our
> plans with end customers and they have confirmed that the design should
> work with their applications.
I am sure l3fwd and testpmd should be ok, I am talking about
something more complicated/unusual.
Below are few examples on top of my head when I think your approach
will generate invalid stats, feel free to correct me, if I am wrong.
1) App doing some sort of bonding itslef, i.e:
struct rte_mbuf pkts[N*2];
k = rte_eth_rx_burst(p0, q0, pkts, N);
n = rte_eth_rx_burst(p1, q1, pkts + k, N);
/*process all packets from both ports at once */
if (n + k != 0)
process_pkts(pkts, n + k);
Now, as I understand, if n==0, then all cycles spent
in process_pkts() will be accounted as idle.
2) App doing something similar to what pdump library does
(creates a copy of a packet and sends it somewhere).
n =rte_eth_rx_burst(p0, q0, &pkt, 1);
if (n != 0) {
dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
if (dup_pkt != NULL)
process_dup_pkt(dup_pkt);
process_pkt(pkt);
}
that relates to ring discussion below:
if there are no mbufs in dup_mp, then ring_deque() will fail
and process_pkt() will be accounted as idle.
3) App dequeues from ring in a bit of unusual way:
/* idle spin loop */
while ((n = rte_ring_count(ring)) == 0)
ret_pause();
n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
if (n != 0)
process_pkts(pkts, n);
here, we can end-up accounting cycles spent in
idle spin loop as busy.
4) Any thread that generates TX traffic on it's own
(something like testpmd tx_only fwd mode)
5) Any thread that depends on both dequeue and enqueue:
n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
...
/* loop till all packets are sent out successfully */
while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
rte_pause();
Now, if n > 0, all cycles spent in enqueue() will be accounted
as 'busy', though from my perspective they probably should
be considered as 'idle'.
Also I expect some problems when packet processing is done inside
rx callbacks, but that probably can be easily fixed.
>
>>> It's worth keeping in mind that this feature is compile-time disabled by
>>> default, so there is no impact to any application/user that does not
>>> wish to use this, for example applications where this type of busyness
>>> is not useful, or for applications that already use other mechanisms to
>>> report similar telemetry.
>> Not sure that adding in new compile-time option disabled by default is
>> a good thing...
>> For me it would be much more preferable if we'll go through a more
>> 'standard' way here:
>> a) define clear API to enable/disable/collect/report such type of stats.
>> b) use some of our sample apps to demonstrate how to use it properly
>> with user-specific code.
>> c) if needed implement some 'silent' stats collection for limited
>> scope of apps via callbacks -
>> let say for run-to-completion apps that do use ether and crypto devs
>> only.
>
> With the compile-time option, its just one build flag for lots of
> applications to silently benefit from this.
There could be a lot of useful and helpfull stats
that user would like to collect (idle/busy, processing latency, etc.).
But, if for each such case we will hard-code new stats collection
into our fast data-path code, then very soon it will become
completely bloated and unmaintainable.
I think we need some generic approach for such extra stats collection.
Callbacks could be one way, Jerin in another mail suggested using
existing trace-point hooks, might be it worth to explore it further.
>
>> However, the upside for applications that do
>>> wish to use this is that there are no code changes required (for the
>>> most part), the feature simply needs to be enabled at compile-time via
>>> the meson option.
>>>
>>> In scenarios where contextual awareness of the application is needed in
>>> order to report more accurate "busyness", the
>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>> sections of code as "busy" or "idle". This way, the application can
>>> assume control of determining the poll busyness of its lcores while
>>> leveraging the telemetry hooks adding in this patchset.
>>>
>>> We did initially consider implementing this via callbacks, however we
>>> found this approach to have 2 main drawbacks:
>>> 1. Application changes are required for all applications wanting to
>>> report this telemetry - rather than the majority getting it for free.
>> Didn't get it - why callbacks approach would require user-app changes?
>> In other situations - rte_power callbacks, pdump, etc. it works
>> transparent to
>> user-leve code.
>> Why it can't be done here in a similar way?
>
> From my understanding, the callbacks would need to be registered by the
> application at the very least (and the callback would have to be
> registered per device/pmd/lib).
Callbacks can be registered by library itself.
AFAIK, latenc-ystats, power and pdump libraries - all use similar
approach. user calls something like xxx_stats_enable() and then library
can iterate over all available devices and setup necessary callbacks.
same for xxx_stats_disable().
>>
>>> 2. Ring does not have callback support, meaning pipelined applications
>>> could not report lcore poll busyness telemetry with this approach.
>> That's another big concern that I have:
>> Why you consider that all rings will be used for a pipilines between
>> threads and should
>> always be accounted by your stats?
>> They could be used for dozens different purposes.
>> What if that ring is used for mempool, and ring_dequeue() just means
>> we try to allocate
>> an object from the pool? In such case, why failing to allocate an
>> object should mean
>> start of new 'idle cycle'?
>
> Another approach could be taken here if the mempool interactions are of
> concern.
>
> From our understanding, mempool operations use the "_bulk" APIs,
> whereas polling operations use the "_burst" APIs. Would only
> timestamping on the "_burst" APIs be better here? That way the mempool
> interactions won't be counted towards the busyness.
Well, it would help to solve one particular case,
but in general I still think it is incomplete and error-prone.
What if pipeline app will use ring_count/ring_dequeue_bulk(),
or even ZC ring API?
What if app will use something different from rte_ring to pass
packets between threads/processes?
As I said before, without some clues from the app, it is probably
not possible to collect such stats in a proper way.
> Including support for pipelined applications using rings is key for a
> number of usecases, this was highlighted as part of the customer
> feedback when we shared the design.
>
>>
>>> Eventdev is another driver which would be completely missed with this
>>> approach.
>> Ok, I see two ways here:
>> - implement CB support for eventdev.
>> -meanwhile clearly document that this stats are not supported for
>> eventdev scenarios (yet).
More information about the dev
mailing list