[dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library

Ananyev, Konstantin konstantin.ananyev at intel.com
Mon Dec 17 14:14:38 CET 2018



> >
> > > > >
> > > > > > > +
> > > > > > > +/* Add a reader thread, running on an lcore, to the list of
> > > > > > > +threads
> > > > > > > + * reporting their quiescent state on a TQS variable.
> > > > > > > + */
> > > > > > > +int __rte_experimental
> > > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
> > > > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
> > > > > > RTE_TQS_MAX_LCORE),
> > > > > > > +				-EINVAL);
> > > > > >
> > > > > > It is not very good practice to make function return different
> > > > > > values and behave in a different way in debug/non-debug mode.
> > > > > > I'd say that for slow-path (functions in .c) it is always good
> > > > > > to check input parameters.
> > > > > > For fast-path (functions in .h) we sometimes skip such checking,
> > > > > > but debug mode can probably use RTE_ASSERT() or so.
> > > > > Makes sense, I will change this in the next version.
> > > > >
> > > > > >
> > > > > >
> > > > > > lcore_id >= RTE_TQS_MAX_LCORE
> > > > > >
> > > > > > Is this limitation really necessary?
> > > > > I added this limitation because currently DPDK application cannot
> > > > > take a mask more than 64bit wide. Otherwise, this should be as big
> > > > > as
> > > > RTE_MAX_LCORE.
> > > > > I see that in the case of '-lcores' option, the number of lcores
> > > > > can be more than the number of PEs. In this case, we still need a
> > > > > MAX limit (but
> > > > can be bigger than 64).
> > > > >
> > > > > > First it means that only lcores can use that API (at least
> > > > > > data-path part), second even today many machines have more than 64
> > cores.
> > > > > > I think you can easily avoid such limitation, if instead of
> > > > > > requiring lcore_id as input parameter, you'll just make it
> > > > > > return index of
> > > > next available entry in w[].
> > > > > > Then tqs_update() can take that index as input parameter.
> > > > > I had thought about a similar approach based on IDs. I was
> > > > > concerned that ID will be one more thing to manage for the
> > > > > application. But, I see the
> > > > limitations of the current approach now. I will change it to allocation
> > based.
> > > > This will support even non-EAL pthreads as well.
> > > >
> > > > Yes, with such approach non-lcore threads will be able to use it also.
> > > >
> > > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need
> > > to be efficient as they can be called from the worker's packet
> > > processing loop (rte_event_dequeue_burst allows blocking. So, the
> > > worker thread needs to call rte_tqs_unregister_lcore before calling
> > rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet
> > processing). Allocating the thread ID in these functions will make them more
> > complex.
> > >
> > > I suggest that we change the argument 'lcore_id' to 'thread_id'. The
> > > application could use 'lcore_id' as 'thread_id' if threads are mapped to
> > physical cores 1:1.
> > >
> > > If the threads are not mapped 1:1 to physical cores, the threads need
> > > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not
> > > see that DPDK has a thread_id concept. For TQS, the thread IDs are global
> > (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation,
> > but I think the thread ID allocation should not be part of this library. Such
> > thread ID might be useful for other libraries.
> >
> > I don't think there is any point to introduce new thread_id concept just for
> > that library.
> Currently, we have rte_gettid API. It is being used by rte_spinlock. However, the thread ID returned here is the thread ID as defined by OS.
> rte_spinlock APIs do not care who defines the thread ID as long as those IDs are unique per thread. I think, if we have a thread_id concept
> that covers non-eal threads as well, it might help other libraries too. For ex: [1] talks about the limitation of per-lcore cache.
> I think this
> limitation can be removed easily if we could have a thread_id that is in a small, well defined space (rather than OS defined thread ID which
> may be an arbitrary number). I see similar issues mentioned for rte_timer.

If we'll just introduce new ID (let's name it thread_id) then we'll just replace one limitation with the other.
If it still would be local_cache[], now based on some thread_id instead of current lcore_id.
I don't see how it will be better than current one.
To make any arbitrary thread to use mempool's cache we need something smarter
then just local_cache[] for each id, but without loss of performance.

> It might be useful in the dynamic threads Bruce talked about at the Dublin summit (I am not sure on this one, just speculating).

That's probably about  make lcore_id allocation/freeing to be dynamic.
> 
> [1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issue-label
> 
> > After all we already have a concept of lcore_id which pretty much serves the
> > same purpose.
> > I still think that we need to either:
> > a) make register/unregister to work with any valid lcore_id (<=
> > RTE_MAX_LCORE)
> I have made this change already, it will be there in the next version.

Ok.

> 
> > b) make register/unregister to return index in w[]
> >
> > For a) will need mask bigger than 64bits.
> > b)  would allow to use data-path API by non-lcores threads too, plus w[]
> > would occupy less space, and check() might be faster.
> > Though yes, as a drawback, for b) register/unregister probably would need
> > extra 'while(CAS(...));' loop.
> Along with the CAS, we also need to search for available index in the array.

Sure, but I thought that one is relatively cheap comparing to CAS itself
(probably not, as cache line with data will be shared between cores).

> 
> > I suppose the question here do you foresee a lot of concurrent
> > register/unregister at data-path?
> IMO, yes, because of the event dev API being blocking.
> We can solve this by providing separate APIs for allocation/freeing of the IDs. I am just questioning where these APIs should be.
> 
> >
> > >
> > > <snip>
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +		while (lcore_mask) {
> > > > > > > +			l = __builtin_ctz(lcore_mask);
> > > > > > > +			if (v->w[l].cnt != t)
> > > > > > > +				break;
> > > > > >
> > > > > > As I understand, that makes control-path function progress
> > > > > > dependent on simultaneous invocation of data-path functions.
> > > > > I agree that the control-path function progress (for ex: how long
> > > > > to wait for freeing the memory) depends on invocation of the
> > > > > data-path functions. The separation of 'start', 'check' and the
> > > > > option not to block in
> > > > 'check' provide the flexibility for control-path to do some other
> > > > work if it chooses to.
> > > > >
> > > > > > In some cases that might cause control-path to hang.
> > > > > > Let say if data-path function wouldn't be called, or user
> > > > > > invokes control-path and data-path functions from the same thread.
> > > > > I agree with the case of data-path function not getting called. I
> > > > > would consider that as programming error. I can document that
> > > > > warning in
> > > > the rte_tqs_check API.
> > > >
> > > > Sure, it can be documented.
> > > > Though that means, that each data-path thread would have to do
> > > > explicit
> > > > update() call for every tqs it might use.
> > > > I just think that it would complicate things and might limit usage
> > > > of the library quite significantly.
> > > Each data path thread has to report its quiescent state. Hence, each
> > > data-path thread has to call update() (similar to how
> > > rte_timer_manage() has to be called periodically on the worker thread).
> >
> > I understand that.
> > Though that means that each data-path thread has to know explicitly what rcu
> > vars it accesses.
> Yes. That is correct. It is both good and bad. It is providing flexibility to reduce the overhead. For ex: in pipeline mode, it may be that a
> particular data structure is accessed only by some of the threads in the application. In this case, this library allows for per data structure
> vars, which reduces the over head. This applies for service cores as well.
> 
> > Would be hard to adopt such API with rcu vars used inside some library.
> > But ok, as I understand people do use QSBR approach in their apps and find it
> > useful.
> It can be adopted in the library with different levels of assumptions/constraints.
> 1) With the assumption that the data plane threads will update the quiescent state. For ex: for rte_hash library we could ask the user to pass
> the TQS variable as input and rte_hash writer APIs can call rte_tqs_start and rte_tqs_check APIs.
> 2) If the assumption in 1) is not good, updating of the quiescent state can be hidden in the library, but again with the assumption that the
> data plane library API is called on a regular basis. For ex: the rte_tqs_update can be called within rte_hash_lookup API.
> 3) If we do not want to assume that the data plane API will be called on a regular basis, then the rte_tqs_register/unregister APIs need to be
> used before and after entering the critical section along with calling rte_tqs_update API. For ex: rte_hash_lookup should have the sequence
> rte_tqs_register, <critical section>, rte_tqs_unregister, rte_tqs_update. (very similar to GP)

#3 is surely possible but it seems quite expensive.
Anyway, as I said before, people do use QSBR approach -
it has the small overhead for readers and relatively straightforward.
So let start with that one, and have some ability to extend the lib
with new methods  in future.

> 
> >
> > > Do you have any particular use case in mind where this fails?
> >
> > Let say it means that library can't be used to add/del RX/TX ethdev callbacks
> > in a safe manner.
> I need to understand this better. I will look at rte_ethdev library.

Ok, you can also have a look at: lib/librte_bpf/bpf_pkt.c
to check how we overcome it now.

> 
> >
> > BTW, two side questions:
> > 1) As I understand what you propose is very similar to QSBR main concept.
> > Wouldn't it be better to name it accordingly to avoid confusion (or at least
> > document it somewhere).
> > I think someone else already raised that question.
> QSBR stands for Quiescent State Based Reclamation. This library already has 'Thread Quiescent State' in the name. Others have
> questioned/suggested why not use RCU instead. I called it thread quiescent state as this library just helps determine if all the readers have
> entered the quiescent state. It does not do anything else.
> 
> However, you are also bringing up an important point, 'will we add other methods of memory reclamation'? With that in mind, may be we
> should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof the API incase we want to add additional RCU types.

Yep, that sounds like a good approach to me.

> 
> > 2) Would QSBR be the only technique in that lib?
> > Any plans to add something similar to GP one too (with MBs at reader-side)?
> I believe, by GP, you mean general-purpose RCU.

Yes.

> In my understanding QSBR is the one with least overhead. For DPDK applications, I think
> reducing that overhead is important. The GP adds additional over head on the reader side.

Yes, but it provides better flexibility.
> I did not see a need to add any additional ones as of now.

Let say your #3 solution above.
I think GP will be cheaper than register/unregister for each library call.

> But if there are use cases that cannot be achieved with the proposed APIs, we can definitely expand it.
> 
> >
> > >
> > > >
> > > > >
> > > > > In the case of same thread calling both control-path and data-path
> > > > > functions, it would depend on the sequence of the calls. The
> > > > > following
> > > > sequence should not cause any hangs:
> > > > > Worker thread
> > > > > 1) 'deletes' an entry from a lock-free data structure
> > > > > 2) rte_tqs_start
> > > > > 3) rte_tqs_update
> > > > > 4) rte_tqs_check (wait == 1 or wait == 0)
> > > > > 5) 'free' the entry deleted in 1)
> > > >
> > > > That an interesting idea, and that should help, I think.
> > > > Probably worth to have {2,3,4} sequence as a new high level function.
> > > >
> > > Yes, this is a good idea. Such a function would be applicable only in
> > > the worker thread. I would prefer to leave it to the application to take care.
> >
> > Yes, it would be applicable only to worker thread, but why we can't have a
> > function for it?
> > Let say it could be 2 different functions: one doing {2,3,4} - for worker threads,
> > and second doing just {2,4} - for control threads.
> > Or it could be just one function that takes extra parameter: lcore_id/w[] index.
> > If it is some predefined invalid value (-1 or so), step #3 will be skipped.
> The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the writers do not have to spend CPU/memory cycles polling for the
> readers' quiescent state. In the context of DPDK, this overhead will be significant (at least equal to the length of 1 while loop on the worker
> core). This is one of the key features of this library. Combining 2,[3], 4 will defeat this purpose. For ex: in the rte_hash library, whenever a
> writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the readers to enter quiescent state. i.e. the
> performance will come down whenever a rte_hash_add is called.

I am not suggesting to replace start+[update+]check with one mega-function.
NP with currently defined API 
I am talking about an additional function for the users where performance is not a main concern -
they just need a function that would do things in  a proper way for themc.
I think having such extra function will simplify their life,
again they can use it as a reference to understand the proper sequence they need to call on their own. 



More information about the dev mailing list