[dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Sat Dec 1 19:37:02 CET 2018

> On Fri, 30 Nov 2018 21:56:30 +0100
> Mattias Rönnblom <mattias.ronnblom at ericsson.com> wrote:
> > On 2018-11-30 03:13, Honnappa Nagarahalli wrote:
> > >>
> > >> Reinventing RCU is not helping anyone.
> > > IMO, this depends on what the rte_tqs has to offer and what the
> requirements are. Before starting this patch, I looked at the liburcu APIs. I
> have to say, fairly quickly (no offense) I concluded that this does not address
> DPDK's needs. I took a deeper look at the APIs/code in the past day and I still
> concluded the same. My partial analysis (analysis of more APIs can be done, I
> do not have cycles at this point) is as follows:
> > >
> > > The reader threads' information is maintained in a linked list[1]. This
> linked list is protected by a mutex lock[2]. Any additions/deletions/traversals
> of this list are blocking and cannot happen in parallel.
> > >
> > > The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call)
> is a blocking call. There is no option provided to make it non-blocking. The
> writer spins cycles while waiting for the grace period to get over.
> > >
> >
> > Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu()
> > which never?
call_rcu (I do not know about defer_rcu, have you looked at the implementation to verify your claim?) requires a separate thread that does garbage collection (this forces a programming model, the thread is even launched by the library). call_rcu() allows you to batch and defer the work to the garbage collector thread. In the garbage collector thread, when 'synchronize_rcu' is called, it spins for at least 1 grace period. Deferring and batching also have the side effect that memory is being held up for longer time.

Why would the average application want to wait for the
> > grace period to be over anyway?
I assume when you say 'average application', you mean the writer(s) are on control plane. 
It has been agreed (in the context of rte_hash) that writer(s) can be on data plane. In this case, 'synchronize_rcu' cannot be called from data plane. If call_rcu has to be called, it adds additional cycles to push the pointers (or any data) to the garbage collector thread to the data plane. I kindly suggest you to take a look for more details in liburcu code and the rte_tqs code.
Additionally, call_rcu function is more than 10 lines.

> >
> > > 'synchronize_rcu' also has grace period lock [4]. If I have multiple writers
> running on data plane threads, I cannot call this API to reclaim the memory in
> the worker threads as it will block other worker threads. This means, there is
> an extra thread required (on the control plane?) which does garbage
> collection and a method to push the pointers from worker threads to the
> garbage collection thread. This also means the time duration from delete to
> free increases putting pressure on amount of memory held up.
> > > Since this API cannot be called concurrently by multiple writers, each
> writer has to wait for other writer's grace period to get over (i.e. multiple
> writer threads cannot overlap their grace periods).
> >
> > "Real" DPDK applications typically have to interact with the outside
> > world using interfaces beyond DPDK packet I/O, and this is best done
> > via an intermediate "control plane" thread running in the DPDK application.
> > Typically, this thread would also be the RCU writer and "garbage
> > collector", I would say.
> >
Agree, that is one way to do it and it comes with its own issues as I described above.

> > >
> > > This API also has to traverse the linked list which is not very well suited for
> calling on data plane.
> > >
> > > I have not gone too much into rcu_thread_offline[5] API. This again needs
> to be used in worker cores and does not look to be very optimal.
> > >
> > > I have glanced at rcu_quiescent_state [6], it wakes up the thread calling
> 'synchronize_rcu' which seems good amount of code for the data plane.
> > >
> >
> > Wouldn't the typical DPDK lcore worker call rcu_quiescent_state()
> > after processing a burst of packets? If so, I would more lean toward
> > "negligible overhead", than "a good amount of code".
DPDK is being used in embedded and real time applications as well. There, processing a burst of packets is not possible due to low latency requirements. Hence it is not possible to amortize the cost.

> >
> > I must admit I didn't look at your library in detail, but I must still
> > ask: if TQS is basically RCU, why isn't it called RCU? And why isn't
> > the API calls named in a similar manner?
I kindly request you to take a look at the patch. More than that, if you have not done already, please take a look at the liburcu implementation as well.
TQS is not RCU (Read-Copy-Update). TQS helps implement RCU. TQS helps to understand when the threads have passed through the quiescent state.
I am also not sure why the name liburcu has RCU in it. It does not do any Read-Copy-Update.

> We used liburcu at Brocade with DPDK. It was just a case of putting
> rcu_quiescent_state in the packet handling
> loop. There were a bunch more cases where control thread needed to
> register/unregister as part of RCU.
I assume that the packet handling loop was a polling loop (correct me if I am wrong). With the support of event dev, we have rte_event_dequeue_burst API which supports blocking till the packets are available (or blocking for an extended period of time). This means that, before calling this API, the thread needs to inform "don't worry about me". Once, this API returns, it needs to inform "worry about me". So, these two APIs need to be efficient. Please look at rte_tqs_register/unregister APIs.

> I think any library would have that issue with user supplied threads.  You need
> a "worry about me" and
> a "don't worry about me" API in the library.
> There is also a tradeoff between call_rcu and defer_rcu about what context
> the RCU callback happens.
> You really need a control thread to handle the RCU cleanup.
That is if you choose to use liburcu. rte_tqs provides the ability to do cleanup efficiently without the need for a control plane thread in DPDK use cases.

> The point is that RCU steps into the application design, and liburcu seems to
> be flexible enough
> and well documented enough to allow for more options.
Agree that RCU steps into application design. That is the reason rte_tqs just does enough and provides the flexibility to the application to implement the RCU however it feels like. DPDK has also stepped into application design by providing libraries like hash, LPM etc.

I do not understand why you think liburcu is flexible enough for DPDK use cases. I mentioned the specific use cases where liburcu is not useful. I did not find anything in the documentation to help me solve these use cases. Appreciate if you could help me understand how I can use liburcu to solve these use cases.

More information about the dev mailing list