[dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Thu Dec 13 08:39:30 CET 2018
Previous message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Next message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> 
> 
> > > >
> > > > > > +
> > > > > > +/* Add a reader thread, running on an lcore, to the list of
> > > > > > +threads
> > > > > > + * reporting their quiescent state on a TQS variable.
> > > > > > + */
> > > > > > +int __rte_experimental
> > > > > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
> > > > > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
> > > > > RTE_TQS_MAX_LCORE),
> > > > > > +				-EINVAL);
> > > > >
> > > > > It is not very good practice to make function return different
> > > > > values and behave in a different way in debug/non-debug mode.
> > > > > I'd say that for slow-path (functions in .c) it is always good
> > > > > to check input parameters.
> > > > > For fast-path (functions in .h) we sometimes skip such checking,
> > > > > but debug mode can probably use RTE_ASSERT() or so.
> > > > Makes sense, I will change this in the next version.
> > > >
> > > > >
> > > > >
> > > > > lcore_id >= RTE_TQS_MAX_LCORE
> > > > >
> > > > > Is this limitation really necessary?
> > > > I added this limitation because currently DPDK application cannot
> > > > take a mask more than 64bit wide. Otherwise, this should be as big
> > > > as
> > > RTE_MAX_LCORE.
> > > > I see that in the case of '-lcores' option, the number of lcores
> > > > can be more than the number of PEs. In this case, we still need a
> > > > MAX limit (but
> > > can be bigger than 64).
> > > >
> > > > > First it means that only lcores can use that API (at least
> > > > > data-path part), second even today many machines have more than 64
> cores.
> > > > > I think you can easily avoid such limitation, if instead of
> > > > > requiring lcore_id as input parameter, you'll just make it
> > > > > return index of
> > > next available entry in w[].
> > > > > Then tqs_update() can take that index as input parameter.
> > > > I had thought about a similar approach based on IDs. I was
> > > > concerned that ID will be one more thing to manage for the
> > > > application. But, I see the
> > > limitations of the current approach now. I will change it to allocation
> based.
> > > This will support even non-EAL pthreads as well.
> > >
> > > Yes, with such approach non-lcore threads will be able to use it also.
> > >
> > I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need
> > to be efficient as they can be called from the worker's packet
> > processing loop (rte_event_dequeue_burst allows blocking. So, the
> > worker thread needs to call rte_tqs_unregister_lcore before calling
> rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet
> processing). Allocating the thread ID in these functions will make them more
> complex.
> >
> > I suggest that we change the argument 'lcore_id' to 'thread_id'. The
> > application could use 'lcore_id' as 'thread_id' if threads are mapped to
> physical cores 1:1.
> >
> > If the threads are not mapped 1:1 to physical cores, the threads need
> > to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not
> > see that DPDK has a thread_id concept. For TQS, the thread IDs are global
> (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation,
> but I think the thread ID allocation should not be part of this library. Such
> thread ID might be useful for other libraries.
> 
> I don't think there is any point to introduce new thread_id concept just for
> that library.
Currently, we have rte_gettid API. It is being used by rte_spinlock. However, the thread ID returned here is the thread ID as defined by OS. rte_spinlock APIs do not care who defines the thread ID as long as those IDs are unique per thread. I think, if we have a thread_id concept that covers non-eal threads as well, it might help other libraries too. For ex: [1] talks about the limitation of per-lcore cache. I think this limitation can be removed easily if we could have a thread_id that is in a small, well defined space (rather than OS defined thread ID which may be an arbitrary number). I see similar issues mentioned for rte_timer.
It might be useful in the dynamic threads Bruce talked about at the Dublin summit (I am not sure on this one, just speculating).

[1] https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#known-issue-label

> After all we already have a concept of lcore_id which pretty much serves the
> same purpose.
> I still think that we need to either:
> a) make register/unregister to work with any valid lcore_id (<=
> RTE_MAX_LCORE)
I have made this change already, it will be there in the next version.

> b) make register/unregister to return index in w[]
> 
> For a) will need mask bigger than 64bits.
> b)  would allow to use data-path API by non-lcores threads too, plus w[]
> would occupy less space, and check() might be faster.
> Though yes, as a drawback, for b) register/unregister probably would need
> extra 'while(CAS(...));' loop.
Along with the CAS, we also need to search for available index in the array.

> I suppose the question here do you foresee a lot of concurrent
> register/unregister at data-path?
IMO, yes, because of the event dev API being blocking.
We can solve this by providing separate APIs for allocation/freeing of the IDs. I am just questioning where these APIs should be.

> 
> >
> > <snip>
> >
> > >
> > > >
> > > > >
> > > > > > +
> > > > > > +		while (lcore_mask) {
> > > > > > +			l = __builtin_ctz(lcore_mask);
> > > > > > +			if (v->w[l].cnt != t)
> > > > > > +				break;
> > > > >
> > > > > As I understand, that makes control-path function progress
> > > > > dependent on simultaneous invocation of data-path functions.
> > > > I agree that the control-path function progress (for ex: how long
> > > > to wait for freeing the memory) depends on invocation of the
> > > > data-path functions. The separation of 'start', 'check' and the
> > > > option not to block in
> > > 'check' provide the flexibility for control-path to do some other
> > > work if it chooses to.
> > > >
> > > > > In some cases that might cause control-path to hang.
> > > > > Let say if data-path function wouldn't be called, or user
> > > > > invokes control-path and data-path functions from the same thread.
> > > > I agree with the case of data-path function not getting called. I
> > > > would consider that as programming error. I can document that
> > > > warning in
> > > the rte_tqs_check API.
> > >
> > > Sure, it can be documented.
> > > Though that means, that each data-path thread would have to do
> > > explicit
> > > update() call for every tqs it might use.
> > > I just think that it would complicate things and might limit usage
> > > of the library quite significantly.
> > Each data path thread has to report its quiescent state. Hence, each
> > data-path thread has to call update() (similar to how
> > rte_timer_manage() has to be called periodically on the worker thread).
> 
> I understand that.
> Though that means that each data-path thread has to know explicitly what rcu
> vars it accesses.
Yes. That is correct. It is both good and bad. It is providing flexibility to reduce the overhead. For ex: in pipeline mode, it may be that a particular data structure is accessed only by some of the threads in the application. In this case, this library allows for per data structure vars, which reduces the over head. This applies for service cores as well.

> Would be hard to adopt such API with rcu vars used inside some library.
> But ok, as I understand people do use QSBR approach in their apps and find it
> useful.
It can be adopted in the library with different levels of assumptions/constraints.
1) With the assumption that the data plane threads will update the quiescent state. For ex: for rte_hash library we could ask the user to pass the TQS variable as input and rte_hash writer APIs can call rte_tqs_start and rte_tqs_check APIs.
2) If the assumption in 1) is not good, updating of the quiescent state can be hidden in the library, but again with the assumption that the data plane library API is called on a regular basis. For ex: the rte_tqs_update can be called within rte_hash_lookup API.
3) If we do not want to assume that the data plane API will be called on a regular basis, then the rte_tqs_register/unregister APIs need to be used before and after entering the critical section along with calling rte_tqs_update API. For ex: rte_hash_lookup should have the sequence rte_tqs_register, <critical section>, rte_tqs_unregister, rte_tqs_update. (very similar to GP)

> 
> > Do you have any particular use case in mind where this fails?
> 
> Let say it means that library can't be used to add/del RX/TX ethdev callbacks
> in a safe manner.
I need to understand this better. I will look at rte_ethdev library.

> 
> BTW, two side questions:
> 1) As I understand what you propose is very similar to QSBR main concept.
> Wouldn't it be better to name it accordingly to avoid confusion (or at least
> document it somewhere).
> I think someone else already raised that question.
QSBR stands for Quiescent State Based Reclamation. This library already has 'Thread Quiescent State' in the name. Others have questioned/suggested why not use RCU instead. I called it thread quiescent state as this library just helps determine if all the readers have entered the quiescent state. It does not do anything else.

However, you are also bringing up an important point, 'will we add other methods of memory reclamation'? With that in mind, may be we should not call it RCU. But, may be call it as rte_rcu_qsbr_xxx? It will also future proof the API incase we want to add additional RCU types.

> 2) Would QSBR be the only technique in that lib?
> Any plans to add something similar to GP one too (with MBs at reader-side)?
I believe, by GP, you mean general-purpose RCU. In my understanding QSBR is the one with least overhead. For DPDK applications, I think reducing that overhead is important. The GP adds additional over head on the reader side. I did not see a need to add any additional ones as of now. But if there are use cases that cannot be achieved with the proposed APIs, we can definitely expand it.

> 
> >
> > >
> > > >
> > > > In the case of same thread calling both control-path and data-path
> > > > functions, it would depend on the sequence of the calls. The
> > > > following
> > > sequence should not cause any hangs:
> > > > Worker thread
> > > > 1) 'deletes' an entry from a lock-free data structure
> > > > 2) rte_tqs_start
> > > > 3) rte_tqs_update
> > > > 4) rte_tqs_check (wait == 1 or wait == 0)
> > > > 5) 'free' the entry deleted in 1)
> > >
> > > That an interesting idea, and that should help, I think.
> > > Probably worth to have {2,3,4} sequence as a new high level function.
> > >
> > Yes, this is a good idea. Such a function would be applicable only in
> > the worker thread. I would prefer to leave it to the application to take care.
> 
> Yes, it would be applicable only to worker thread, but why we can't have a
> function for it?
> Let say it could be 2 different functions: one doing {2,3,4} - for worker threads,
> and second doing just {2,4} - for control threads.
> Or it could be just one function that takes extra parameter: lcore_id/w[] index.
> If it is some predefined invalid value (-1 or so), step #3 will be skipped.
The rte_tqs_start and rte_tqs_check are separated into 2 APIs so that the writers do not have to spend CPU/memory cycles polling for the readers' quiescent state. In the context of DPDK, this overhead will be significant (at least equal to the length of 1 while loop on the worker core). This is one of the key features of this library. Combining 2,[3], 4 will defeat this purpose. For ex: in the rte_hash library, whenever a writer on the data path calls rte_hash_add, (with 2,3,4 combined) it will wait for the rest of the readers to enter quiescent state. i.e. the performance will come down whenever a rte_hash_add is called.

I am trying to understand your concern. If it is that, 'programmers may not use the right sequence', I would prefer to treat that as programming error. May be it is better addressed by providing debugging capabilities.

> 
> Konstantin
Previous message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Next message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list