[RFC 0/2] Add high-performance timer facility
    Morten Brørup 
    mb at smartsharesystems.com
       
    Wed Mar  1 18:06:27 CET 2023
    
    
  
> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
> Sent: Wednesday, 1 March 2023 16.50
> 
> On 2023-03-01 14:31, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
> >> Sent: Wednesday, 1 March 2023 12.18
> >>
> >> On 2023-02-28 17:01, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
> >>>> Sent: Tuesday, 28 February 2023 10.39
> >>>
> >>> I have been looking for a high performance timer library (for use in
> a fast
> >> path TCP stack), and this looks very useful, Mattias.
> >>>
> >>> My initial feedback is based on quickly skimming the patch source
> code, and
> >> reading this cover letter.
> >>>
> >>>>
> >>>> This patchset is an attempt to introduce a high-performance, highly
> >>>> scalable timer facility into DPDK.
> >>>>
> >>>> More specifically, the goals for the htimer library are:
> >>>>
> >>>> * Efficient handling of a handful up to hundreds of thousands of
> >>>>     concurrent timers.
> >>>> * Reduced overhead of adding and canceling timers.
> >>>> * Provide a service functionally equivalent to that of
> >>>>     <rte_timer.h>. API/ABI backward compatibility is secondary.
> >>>>
> >>>> In the author's opinion, there are two main shortcomings with the
> >>>> current DPDK timer library (i.e., rte_timer.[ch]).
> >>>>
> >>>> One is the synchronization overhead, where heavy-weight full-
> barrier
> >>>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
> >>>> lists, but any thread may add or cancel (or otherwise access)
> timers
> >>>> managed by another lcore (and thus resides in its timer skip list).
> >>>>
> >>>> The other is an algorithmic shortcoming, with rte_timer.c's
> reliance
> >>>> on a skip list, which, seemingly, is less efficient than certain
> >>>> alternatives.
> >>>>
> >>>> This patchset implements a hierarchical timer wheel (HWT, in
> >>>
> >>> Typo: HWT or HTW?
> >>
> >> Yes. I don't understand how I could managed to make so many such HTW
> ->
> >> HWT typos. At least I got the filenames (rte_htw.[ch]) correct.
> >>
> >>>
> >>>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
> >>>> Hierarchical Timing Wheels: Data Structures for the Efficient
> >>>> Implementation of a Timer Facility". A HWT is a data structure
> >>>> purposely design for this task, and used by many operating system
> >>>> kernel timer facilities.
> >>>>
> >>>> To further improve the solution described by Varghese and Lauck, a
> >>>> bitset is placed in front of each of the timer wheel in the HWT,
> >>>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing
> time
> >>>> and expiry processing).
> >>>>
> >>>> Cycle-efficient scanning and manipulation of these bitsets are
> crucial
> >>>> for the HWT's performance.
> >>>>
> >>>> The htimer module keeps a per-lcore (or per-registered EAL thread)
> HWT
> >>>> instance, much like rte_timer.c keeps a per-lcore skip list.
> >>>>
> >>>> To avoid expensive synchronization overhead for thread-local timer
> >>>> management, the HWTs are accessed only from the "owning" thread.
> Any
> >>>> interaction any other thread has with a particular lcore's timer
> >>>> wheel goes over a set of DPDK rings. A side-effect of this design
> is
> >>>> that all operations working toward a "remote" HWT must be
> >>>> asynchronous.
> >>>>
> >>>> The <rte_htimer.h> API is available only to EAL threads and
> registered
> >>>> non-EAL threads.
> >>>>
> >>>> The htimer API allows the application to supply the current time,
> >>>> useful in case it already has retrieved this for other purposes,
> >>>> saving the cost of a rdtsc instruction (or its equivalent).
> >>>>
> >>>> Relative htimer does not retrieve a new time, but reuse the current
> >>>> time (as known via/at-the-time of the manage-call), again to shave
> off
> >>>> some cycles of overhead.
> >>>
> >>> I have a comment to the two points above.
> >>>
> >>> I agree that the application should supply the current time.
> >>>
> >>> This should be the concept throughout the library. I don't
> understand why
> >> TSC is used in the library at all?
> >>>
> >>> Please use a unit-less tick, and let the application decide what one
> tick
> >> means.
> >>>
> >>
> >> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more
> >> sense if you think of the user of the API as not just a "monolithic"
> >> application, but rather a set of different modules, developed by
> >> different organizations, and reused across a set of applications. The
> >> idea behind the API design is they should all be able to share one
> timer
> >> service instance.
> >>
> >> The different parts of the application and any future DPDK platform
> >> modules that use the htimer service needs to agree what a tick means
> in
> >> terms of actual wall-time, if it's not mandated by the API.
> >
> > I see. Then those non-monolithic applications can agree that the unit
> of time is nanoseconds, or whatever makes sense for those applications.
> And then they can instantiate one shared HTW for that purpose.
> >
> 
> <rte_htimer_mgr.h> contains nothing but shared HTWs.
> 
> > There is no need to impose such an API limit on other users of the
> library.
> >
> >>
> >> There might be room for module-specific timer wheels as well, with
> >> different resolution or other characteristics. The event timer
> adapter's
> >> use of a timer wheel could be one example (although I'm not sure it
> is).
> >
> > We are not using the event device, and I have not looked into it, so I
> have no qualified comments to this.
> >
> >>
> >> If timer-wheel-as-a-private-lego-piece is also a valid use case, then
> >> one could consider make the <rte_htw.h> API public as well. That is
> what
> >> I think you as asking for here: a generic timer wheel that doesn't
> know
> >> anything about time sources, time source time -> tick conversion, or
> >> timer source time -> monotonic wall time conversion, and maybe is
> also
> >> not bound to a particular thread.
> >
> > Yes, that is what I had been searching the Internet for.
> >
> > (I'm not sure what you mean by "not bound to a particular thread".
> Your per-thread design seems good to me.)
> >
> > I don't want more stuff in the EAL. What I want is high-performance
> DPDK libraries we can use in our applications.
> >
> >>
> >> I picked TSC because it seemed like a good "universal time unit" for
> >> DPDK. rdtsc (and its equivalent) is also a very precise (especially
> on
> >> x86) and cheap-to-retrieve (especially on ARM, from what I
> understand).
> >
> > The TSC does have excellent performance, but on all other parameters
> it is a horrible time keeper: The measurement unit depends on the
> underlying hardware, the TSC drifts depending on temperature, it cannot
> be PTP synchronized, the list is endless!
> >
> >>
> >> That said, at the moment, I'm leaning toward nanoseconds (uint64_t
> >> format) should be the default for timer expiration time instead of
> TSC.
> >> TSC could still be an option for passing the current time, since TSC
> >> will be a common time source, and it shaves off one conversion.
> >
> > There are many reasons why nanoseconds is a much better choice than
> TSC.
> >
> >>
> >>> A unit-less tick will also let the application instantiate a HTW
> with higher
> >> resolution than the TSC. (E.g. think about oversampling in audio
> processing,
> >> or Brezenham's line drawing algorithm for 2D visuals - oversampling
> can sound
> >> and look better.)
> >
> > Some of the timing data in our application have a resolution orders of
> magnitude higher than one nanosecond. If we combined that with a HTW
> library with nanosecond resolution, we would need to keep these timer
> values in two locations: The original high-res timer in our data
> structure, and the shadow low-res (nanosecond) timer in the HTW.
> >
> 
> There is no way you will meet timers with anything approaching
> pico-second-level precision.
Correct. Our sub-nanosecond timers don't need to meet the exact time, but the higher resolution prevents loss of accuracy when a number has been added to it many times. Think of it like a special fixed-point number, where the least significant part is included to ensure accuracy in calculations, while the actual timer only considers the most significant part of the number.
> You will also get into a value range issue,
> since you will wrap around a 64-bit integer in a matter of days.
Yes. We use timers with different scales for individual purposes. Our highest resolution are sub-nanosecond.
> 
> The HTW only stores the timeout in ticks, not TSC, nanoseconds or
> picoseconds.
Excellent. Then I'm happy.
> Generally, you don't want pico-second-level tick
> granularity, since it increases the overhead of advancing the wheel(s).
We currently use proprietary algorithms for our bandwidth scheduling. It seems that a HTW is not a good fit for this purpose. Perhaps you are offering a hammer, and it's not a good replacement for my screwdriver.
I suppose that nanosecond resolution suffices for a TCP stack, which is the use case I have been on the lookout for a timer library for. :-)
> The first (lowest-significance) few wheels will pretty much always be
> empty.
> 
> > We might also need to frequently update the HTW timers to prevent
> drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when
> rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6)
> rounded. This level of drifting would also make periodic timers in the
> HTW useless.
> >
> 
> Useless, for a certain class of applications. What application would
> that be?
Sorry about being unclear there. Yes, I only meant the specific application I was talking about, i.e. our application for high precision bandwidth management. For reference, 1 bit at 100 Gbit/s is 10 picoseconds.
> 
> > Please note: I haven't really considered merging the high-res timing
> in our application with this HTW, and I'm also not saying that PERIODIC
> timers in the HTW are required or even useful for our application. I'm
> only providing arguments for a unit-less time!
> >
> >>>
> >>> For reference (supporting my suggestion), the dynamic timestamp
> field in the
> >> rte_mbuf structure is also defined as being unit-less. (I think
> NVIDIA
> >> implements it as nanoseconds, but that's an implementation specific
> choice.)
> >>>
> >>>>
> >>>> A semantic improvement compared to the <rte_timer.h> API is that
> the
> >>>> htimer library can give a definite answer on the question if the
> timer
> >>>> expiry callback was called, after a timer has been canceled.
> >>>>
> >>>> Below is a performance data from DPDK's 'app/test' micro
> benchmarks,
> >>>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
> >>>> test_htimer_mgr_perf.c) aren't identical in their structure, but
> the
> >>>> numbers give some indication of the difference.
> >>>>
> >>>> Use case               htimer  timer
> >>>> ------------------------------------
> >>>> Add timer                 28    253
> >>>> Cancel timer              10    412
> >>>> Async add (source lcore)  64
> >>>> Async add (target lcore)  13
> >>>>
> >>>> (AMD 5900X CPU. Time in TSC.)
> >>>>
> >>>> Prototype integration of the htimer library into real, timer-heavy,
> >>>> applications indicates that htimer may result in significant
> >>>> application-level performance gains.
> >>>>
> >>>> The bitset implementation which the HWT implementation depends upon
> >>>> seemed generic-enough and potentially useful outside the world of
> >>>> HWTs, to justify being located in the EAL.
> >>>>
> >>>> This patchset is very much an RFC, and the author is yet to form an
> >>>> opinion on many important issues.
> >>>>
> >>>> * If deemed a suitable replacement, should the htimer replace the
> >>>>     current DPDK timer library in some particular (ABI-breaking)
> >>>>     release, or should it live side-by-side with the then-legacy
> >>>>     <rte_timer.h> API? A lot of things in and outside DPDK depend
> on
> >>>>     <rte_timer.h>, so coexistence may be required to facilitate a
> smooth
> >>>>     transition.
> >>>
> >>> It's my immediate impression that they are totally different in both
> design
> >> philosophy and API.
> >>>
> >>> Personal opinion: I would call it an entirely different library.
> >>>
> >>>>
> >>>> * Should the htimer and htw-related files be colocated with
> rte_timer.c
> >>>>     in the timer library?
> >>>
> >>> Personal opinion: No. This is an entirely different library, and
> should live
> >> for itself in a directory of its own.
> >>>
> >>>>
> >>>> * Would it be useful for applications using asynchronous cancel to
> >>>>     have the option of having the timer callback run not only in
> case of
> >>>>     timer expiration, but also cancellation (on the target lcore)?
> The
> >>>>     timer cb signature would need to include an additional
> parameter in
> >>>>     that case.
> >>>
> >>> If one thread cancels something in another thread, some
> synchronization
> >> between the threads is going to be required anyway. So we could
> reprase your
> >> question: Will the burden of the otherwise required synchronization
> between
> >> the two threads be significantly reduced if the library has the
> ability to run
> >> the callback on asynchronous cancel?
> >>>
> >>
> >> Yes.
> >>
> >> Intuitively, it seems convenient that if you hand off a timer to a
> >> different lcore, the timer callback will be called exactly once,
> >> regardless if the timer was canceled or expired.
> >>
> >> But, as you indicate, you may still need synchronization to solve the
> >> resource reclamation issue.
> >>
> >>> Is such a feature mostly "Must have" or "Nice to have"?
> >>>
> >>> More thoughts in this area...
> >>>
> >>> If adding and additional callback parameter, it could be an enum, so
> the
> >> callback could be expanded to support "timeout (a.k.a. timer fired)",
> "cancel"
> >> and more events we have not yet come up with, e.g. "early kick".
> >>>
> >>
> >> Yes, or an int.
> >>
> >>> Here's an idea off the top of my head: An additional callback
> parameter has
> >> a (small) performance cost incurred with every timer fired (which is
> a very
> >> large multiplier). It might not be required. As an alternative to an
> "what
> >> happened" parameter to the callback, the callback could investigate
> the state
> >> of the object for which the timer fired, and draw its own conclusion
> on how to
> >> proceed. Obviously, this also has a performance cost, but perhaps the
> callback
> >> works on the object's state anyway, making this cost insignificant.
> >>>
> >>
> >> It's not obvious to me that you, in the timer callback, can determine
> >> what happened, if the same callback is called both in the cancel and
> the
> >> expired case.
> >>
> >> The cost of an extra integer passed in a register (or checking a
> flag,
> >> if the timer callback should be called at all at cancellation) that
> is
> >> the concern for me; it's extra bit of API complexity.
> >
> > Then introduce the library without this feature. More features can be
> added later.
> >
> > The library will be introduced as "experimental", so we are free to
> improve it and modify the ABI along the way.
> >
> >>
> >>> Here's another alternative to adding a "what happened" parameter to
> the
> >> callback:
> >>>
> >>> The rte_htimer could have one more callback pointer, which (if set)
> will be
> >> called on cancellation of the timer.
> >>>
> >>
> >> This will grow the timer struct with 16 bytes.
> >
> > If the rte_htimer struct stays within one cache line, it should be
> acceptable.
> >
> 
> Timer structs are often embedded in other structures, and need not
> themselves be cache line aligned (although the "parent" struct may need
> to be, e.g. if it's dynamically allocated).
> 
> So smaller is better. Just consider if you want your attosecond-level
> time stamp in a struct:
> 
> struct my_timer {
>      uint64_t high_precision_time_high_bits;
>      uint64_t high_precision_time_low_bits;
>      struct rte_htimer timer;
> };
> 
> ...and you allocate those structs from a mempool. If rte_htimer is small
> enough, you will fit on one cache line.
Ahh... I somehow assumed they only existed as stand-alone elements inside the HTW.
Then I obviously agree that shorter is better.
> 
> > On the other hand, this approach is less generic than passing an
> additional parameter. (E.g. add yet another callback pointer for "early
> kick"?)
> >
> > BTW, async cancel is a form of inter-thread communication. Does this
> library really need to provide any inter-thread communication
> mechanisms? Doesn't an inter-thread communication mechanism belong in a
> separate library?
> >
> 
> Yes, <rte_htimer_mgr.h> needs this because:
> 1) Being able to schedule timers on a remote lcore is a useful feature
> (especially since we don't have much else in terms of deferred work
> mechanisms in DPDK).
Although remote procedures is a useful feature, providing such a feature doesn't necessarily belong in a library that uses remote procedures.
> 2) htimer aspires to be a plug-in replacement for <rte_timer.h> (albeit
> an ABI-breaking one).
This is a good argument.
But I would much rather have a highly tuned stand-alone HTW library than a plug-in replacement of the old <rte_timer.h>.
> 
> The pure HTW is in rte_htw.[ch].
> 
> Plus, with the current design, async operations basically come for free
> (if you don't use them), from a performance perspective. The extra
> overhead boils down to occasionally polling an empty ring, which is an
> inexpensive operation.
OK. Then no worries.
> 
> >>
> >>>>
> >>>> * Should the rte_htimer be a nested struct, so the htw parts be
> separated
> >>>>     from the htimer parts?
> >>>>
> >>>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
> >>>>     <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should
> it
> >>>>     be so?
> >>>>
> >>>> * rte_htimer struct is only supposed to be used by the application
> to
> >>>>     give an indication of how much memory it needs to allocate, and
> is
> >>>>     its member are not supposed to be directly accessed (w/ the
> possible
> >>>>     exception of the owner_lcore_id field). Should there be a dummy
> >>>>     struct, or a #define RTE_HTIMER_MEMSIZE or a
> rte_htimer_get_memsize()
> >>>>     function instead, serving the same purpose? Better
> encapsulation,
> >>>>     but more inconvenient for applications. Run-time dynamic sizing
> >>>>     would force application-level dynamic allocations.
> >>>>
> >>>> * Asynchronous cancellation is a little tricky to use for the
> >>>>     application (primarily due to timer memory reclamation/race
> >>>>     issues). Should this functionality be removed?
> >>>>
> >>>> * Should rte_htimer_mgr_init() also retrieve the current time? If
> so,
> >>>>     there should to be a variant which allows the user to specify
> the
> >>>>     time (to match rte_htimer_mgr_manage_time()). One pitfall with
> the
> >>>>     current proposed API is an application calling
> rte_htimer_mgr_init()
> >>>>     and then immediately adding a timer with a relative timeout, in
> >>>>     which case the current absolute time used is 0, which might be
> a
> >>>>     surprise.
> >>>>
> >>>> * Should libdivide (optionally) be used to avoid the div in the TSC
> ->
> >>>>     tick conversion? (Doesn't improve performance on Zen 3, but may
> >>>>     do on other CPUs.) Consider <rte_reciprocal.h> as well.
> >>>>
> >>>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts
> can be
> >>>>     used for conversion? Very minor performance gains to be found
> there,
> >>>>     at least on Zen 3 cores.
> >>>>
> >>>> * Should it be possible to supply the time in rte_htimer_mgr_add()
> >>>>     and/or rte_htimer_mgr_manage_time() functions as ticks, rather
> than
> >>>>     as TSC? Should it be possible to also use nanoseconds?
> >>>>     rte_htimer_mgr_manage_time() would need a flags parameter in
> that
> >>>>     case.
> >>>
> >>> Do not use TSC anywhere in this library. Let the application decide
> the
> >> meaning of a tick.
> >>>
> >>>>
> >>>> * Would the event timer adapter be best off using <rte_htw.h>
> >>>>     directly, or <rte_htimer.h>? In the latter case, there needs to
> be a
> >>>>     way to instantiate more HWTs (similar to the "alt" functions of
> >>>>     <rte_timer.h>)?
> >>>>
> >>>> * Should the PERIODICAL flag (and the complexity it brings) be
> >>>>     removed? And leave the application with only single-shot
> timers, and
> >>>>     the option to re-add them in the timer callback.
> >>>
> >>> First thought: Yes, keep it lean and remove the periodical stuff.
> >>>
> >>> Second thought: This needs a more detailed analysis.
> >>>
> >>>   From one angle:
> >>>
> >>> How many PERIODICAL versus ONESHOT timers do we expect?
> >>>
> >>
> >> I suspect you should be prepared for the ratio being anything.
> >
> > In theory, anything is possible. But I'm asking that we consider
> realistic use cases.
> >
> >>
> >>> Intuitively, I would use this library for ONESHOT timers, and
> perhaps
> >> implement my periodical timers by other means.
> >>>
> >>> If the PERIODICAL:ONESHOT ratio is low, we can probably live with
> the extra
> >> cost of cancel+add for a few periodical timers.
> >>>
> >>>   From another angle:
> >>>
> >>> What is the performance gain with the PERIODICAL flag?
> >>>
> >>
> >> None, pretty much. It's just there for convenience.
> >
> > OK, then I suggest that you remove it, unless you get objections.
> >
> > The library can be expanded with useful features at any time later.
> Useless features are (nearly) impossible to remove, once they are in
> there - they are just "technical debt" with associated maintenance
> costs, added complexity weaving into other features, etc..
> >
> >>
> >>> Without a periodical timer, cancel+add costs 10+28 cycles. How many
> cycles
> >> would a "move" function, performing both cancel and add, use?
> >>>
> >>> And then compare that to the cost (in cycles) of repeating a timer
> with
> >> PERIODICAL?
> >>>
> >>> Furthermore, not having the PERIODICAL flag probably improves the
> >> performance for non-periodical timers. How many cycles could we gain
> here?
> >>>
> >>>
> >>> Another, vaguely related, idea:
> >>>
> >>> The callback pointer might not need to be stored per rte_htimer, but
> could
> >> instead be common for the rte_htw.
> >>>
> >>
> >> Do you mean rte_htw, or rte_htimer_mgr?
> >>
> >> If you make one common callback, all the different parts of the
> >> application needs to be coordinated (in a big switch-statement, or
> >> something of that sort), or have some convention for using an
> >> application-specific wrapper structure (accessed via container_of()).
> >>
> >> This is a problem if the timer service API consumer is a set of
> largely
> >> uncoordinated software modules.
> >>
> >> Btw, the eventdev API has the same issue, and the proposed event
> >> dispatcher is one way to help facilitate application-internal
> decoupling.
> >>
> >> For a module-private rte_htw instance your suggestion may work, but
> not
> >> for <rte_htimer_mgr.h>.
> >
> > I was speculating that a common callback pointer might provide a
> performance benefit for single-purpose HTW instances. (The same concept
> applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer
> Cancelled", and an "Early Kick" callback pointer - i.e. having the
> callback pointers per HTW instance, instead of per timer.)
> >
> >>
> >>> When a timer fires, the callback probably needs to check/update the
> state of
> >> the object for which the timer fired anyway, so why not just let the
> >> application use that state to determine the appropriate action. This
> might
> >> provide some performance benefit.
> >>>
> >>> It might complicate using one HTW for multiple different purposes,
> though.
> >> Probably a useless idea, but I wanted to share the idea anyway. It
> might
> >> trigger other, better ideas in the community.
> >>>
> >>>>
> >>>> * Should the async result codes and the sync cancel error codes be
> merged
> >>>>     into one set of result codes?
> >>>>
> >>>> * Should the rte_htimer_mgr_async_add() have a flag which allow
> >>>>     buffering add request messages until rte_htimer_mgr_process()
> is
> >>>>     called? Or any manage function. Would reduce ring signaling
> overhead
> >>>>     (i.e., burst enqueue operations instead of single-element
> >>>>     enqueue). Could also be a rte_htimer_mgr_async_add_burst()
> function,
> >>>>     solving the same "problem" a different way. (The signature of
> such
> >>>>     a function would not be pretty.)
> >>>>
> >>>> * Does the functionality provided by the rte_htimer_mgr_process()
> >>>>     function match its the use cases? Should there me a more clear
> >>>>     separation between expiry processing and asynchronous operation
> >>>>     processing?
> >>>>
> >>>> * Should the patchset be split into more commits? If so, how?
> >>>>
> >>>> Thanks to Erik Carrillo for his assistance.
> >>>>
> >>>> Mattias Rönnblom (2):
> >>>>     eal: add bitset type
> >>>>     eal: add high-performance timer facility
> >
    
    
More information about the dev
mailing list