[dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library

Ananyev, Konstantin konstantin.ananyev at intel.com
Wed Nov 28 16:25:21 CET 2018
Previous message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Next message: [dpdk-dev] [RFC 3/3] test/tqs: Add API and functional tests
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >
> > Hi Honnappa,
> Thank you for reviewing the patch, appreciate your comments.
> 
> >
> > > +
> > > +/* Allocate a new TQS variable with the name *name* in memory. */
> > > +struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,
> > > +int socket_id, uint64_t lcore_mask) {
> > > +	char tqs_name[RTE_TQS_NAMESIZE];
> > > +	struct rte_tailq_entry *te, *tmp_te;
> > > +	struct rte_tqs_list *tqs_list;
> > > +	struct rte_tqs *v, *tmp_v;
> > > +	int ret;
> > > +
> > > +	if (name == NULL) {
> > > +		RTE_LOG(ERR, TQS, "Invalid input parameters\n");
> > > +		rte_errno = -EINVAL;
> > > +		return NULL;
> > > +	}
> > > +
> > > +	te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
> > > +	if (te == NULL) {
> > > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
> > > +		rte_errno = -ENOMEM;
> > > +		return NULL;
> > > +	}
> > > +
> > > +	snprintf(tqs_name, sizeof(tqs_name), "%s", name);
> > > +	v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
> > > +				RTE_CACHE_LINE_SIZE, socket_id);
> > > +	if (v == NULL) {
> > > +		RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS
> > variable\n");
> > > +		rte_errno = -ENOMEM;
> > > +		goto alloc_error;
> > > +	}
> > > +
> > > +	ret = snprintf(v->name, sizeof(v->name), "%s", name);
> > > +	if (ret < 0 || ret >= (int)sizeof(v->name)) {
> > > +		rte_errno = -ENAMETOOLONG;
> > > +		goto alloc_error;
> > > +	}
> > > +
> > > +	te->data = (void *) v;
> > > +	v->lcore_mask = lcore_mask;
> > > +
> > > +	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
> > > +
> > > +	tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
> > > +
> > > +	/* Search if a TQS variable with the same name exists already */
> > > +	TAILQ_FOREACH(tmp_te, tqs_list, next) {
> > > +		tmp_v = (struct rte_tqs *) tmp_te->data;
> > > +		if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
> > > +			break;
> > > +	}
> > > +
> > > +	if (tmp_te != NULL) {
> > > +		rte_errno = -EEXIST;
> > > +		goto tqs_exist;
> > > +	}
> > > +
> > > +	TAILQ_INSERT_TAIL(tqs_list, te, next);
> > > +
> > > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
> > > +
> > > +	return v;
> > > +
> > > +tqs_exist:
> > > +	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
> > > +
> > > +alloc_error:
> > > +	rte_free(te);
> > > +	rte_free(v);
> > > +	return NULL;
> > > +}
> >
> > That seems quite heavy-weight function just to allocate sync variable.
> > As size of struct rte_tqs is constant and known to the user, might be better just
> > provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory
> > for it by himself.
> >
> I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it. 

Yes.

> Agree. I also
> do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.
> 
> The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the
> rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as
> well.
> 
> Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will
> remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.

Yes, that's what I suggest.
My thought was - it is just another data structure used for synchronization (as spinlock, rwlock, etc.).
So should be possible to allocate it statically and we probably don't need to have an ability to lookup
such variable by name via tailq.

> 
> > > +
> > > +/* Add a reader thread, running on an lcore, to the list of threads
> > > + * reporting their quiescent state on a TQS variable.
> > > + */
> > > +int __rte_experimental
> > > +rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
> > > +	TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
> > RTE_TQS_MAX_LCORE),
> > > +				-EINVAL);
> >
> > It is not very good practice to make function return different values and behave
> > in a different way in debug/non-debug mode.
> > I'd say that for slow-path (functions in .c) it is always good to check input
> > parameters.
> > For fast-path (functions in .h) we sometimes skip such checking, but debug
> > mode can probably use RTE_ASSERT() or so.
> Makes sense, I will change this in the next version.
> 
> >
> >
> > lcore_id >= RTE_TQS_MAX_LCORE
> >
> > Is this limitation really necessary?
> I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as
> RTE_MAX_LCORE.
> I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit
> (but can be bigger than 64).
> 
> > First it means that only lcores can use that API (at least data-path part), second
> > even today many machines have more than 64 cores.
> > I think you can easily avoid such limitation, if instead of requiring lcore_id as
> > input parameter, you'll just make it return index of next available entry in w[].
> > Then tqs_update() can take that index as input parameter.
> I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I
> see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.

Yes, with such approach non-lcore threads will be able to use it also.

 
> >
> > > +
> >
> > > +	/* Worker thread has to count the quiescent states
> > > +	 * only from the current value of token.
> > > +	 */
> > > +	v->w[lcore_id].cnt = v->token;
> >
> > Wonder what would happen, if new reader will call register(), after writer calls
> > start()?
> > Looks like a race-condition.
> > Or such pattern is not supported?
> The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the
> reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is
> already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).

Yes, with _acquire_ in place it seems to be good here.  

> 
> >
> > > +
> > > +	/* Release the store to initial TQS count so that workers
> > > +	 * can use it immediately after this function returns.
> > > +	 */
> > > +	__atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),
> > > +__ATOMIC_RELEASE);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Trigger the worker threads to report the quiescent state
> > > + * status.
> > > + *
> > > + * This is implemented as a lock-free function. It is multi-thread
> > > + * safe and can be called from the worker threads as well.
> > > + *
> > > + * @param v
> > > + *   TQS variable
> > > + * @param n
> > > + *   Expected number of times the quiescent state is entered
> > > + * @param t
> > > + *   - If successful, this is the token for this call of the API.
> > > + *     This should be passed to rte_tqs_check API.
> > > + * @return
> > > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).
> > > + *   - 0 Otherwise and always (non-debug mode compilation).
> > > + */
> > > +static __rte_always_inline int __rte_experimental
> > > +rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {
> > > +	TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
> > > +
> > > +	/* This store release will ensure that changes to any data
> > > +	 * structure are visible to the workers before the token
> > > +	 * update is visible.
> > > +	 */
> > > +	*t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Update quiescent state for the worker thread on a lcore.
> > > + *
> > > + * This is implemented as a lock-free function. It is multi-thread safe.
> > > + * All the worker threads registered to report their quiescent state
> > > + * on the TQS variable must call this API.
> > > + *
> > > + * @param v
> > > + *   TQS variable
> > > + */
> > > +static __rte_always_inline void __rte_experimental
> > > +rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {
> > > +	uint32_t t;
> > > +
> > > +	TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=
> > RTE_TQS_MAX_LCORE);
> > > +
> > > +	/* Load the token before the worker thread loads any other
> > > +	 * (lock-free) data structure. This ensures that updates
> > > +	 * to the data structures are visible if the update
> > > +	 * to token is visible.
> > > +	 */
> > > +	t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
> >
> > Hmm, I am not very familiar with C11 model, but it looks like a race condition
> > to me:
> > as I understand, update() supposed be called at the end of reader's critical
> > section, correct?
> Yes, the understanding is correct.
> 
> > But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to
> > move earlier reads (and writes) after it.
> Yes, your understanding is correct.
> 
> > It probably needs to be a full ACQ_REL here.
> >
> The sequence of operations is as follows:
> 1) Writer 'deletes' an entry from a lock-free data structure
> 2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to
> 'token' is visible (to the data plane threads).
> 3) Reader calls rte_tqs_update - This API load-acquires the 'token'.
> 	a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if
> that reference is due to earlier reads being moved after load-acquire of 'token').
> 	b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to
> reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)

Yes, you right, it's me being confused.

> 
> 
> > > +	if (v->w[lcore_id].cnt != t)
> > > +		v->w[lcore_id].cnt++;
> > > +}
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Checks if all the worker threads have entered the quiescent state
> > > + * 'n' number of times. 'n' is provided in rte_tqs_start API.
> > > + *
> > > + * This is implemented as a lock-free function. It is multi-thread
> > > + * safe and can be called from the worker threads as well.
> > > + *
> > > + * @param v
> > > + *   TQS variable
> > > + * @param t
> > > + *   Token returned by rte_tqs_start API
> > > + * @param wait
> > > + *   If true, block till all the worker threads have completed entering
> > > + *   the quiescent state 'n' number of times
> > > + * @return
> > > + *   - 0 if all worker threads have NOT passed through specified number
> > > + *     of quiescent states.
> > > + *   - 1 if all worker threads have passed through specified number
> > > + *     of quiescent states.
> > > + *   - -EINVAL if the parameters are invalid (debug mode compilation only).
> > > + */
> > > +static __rte_always_inline int __rte_experimental
> > > +rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {
> > > +	uint64_t l;
> > > +	uint64_t lcore_mask;
> > > +
> > > +	TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
> > > +
> > > +	do {
> > > +		/* Load the current lcore_mask before loading the
> > > +		 * worker thread quiescent state counters.
> > > +		 */
> > > +		lcore_mask = __atomic_load_n(&v->lcore_mask,
> > __ATOMIC_ACQUIRE);
> >
> > What would happen if reader will call unregister() simultaneously with check()
> > and will update lcore_mask straight after that load?
> > As I understand check() might hang in such case.
> If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the
> 'lcore_mask' is loaded again.

True, my confusion again.

> 
> >
> > > +
> > > +		while (lcore_mask) {
> > > +			l = __builtin_ctz(lcore_mask);
> > > +			if (v->w[l].cnt != t)
> > > +				break;
> >
> > As I understand, that makes control-path function progress dependent on
> > simultaneous invocation of data-path functions.
> I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path
> functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other
> work if it chooses to.
> 
> > In some cases that might cause control-path to hang.
> > Let say if data-path function wouldn't be called, or user invokes control-path
> > and data-path functions from the same thread.
> I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in
> the rte_tqs_check API.

Sure, it can be documented.
Though that means, that each data-path thread would have to do explicit update() call
for every tqs it might use.
I just think that it would complicate things and might limit usage of the library quite significantly.
 
> 
> In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following
> sequence should not cause any hangs:
> Worker thread
> 1) 'deletes' an entry from a lock-free data structure
> 2) rte_tqs_start
> 3) rte_tqs_update
> 4) rte_tqs_check (wait == 1 or wait == 0)
> 5) 'free' the entry deleted in 1)

That an interesting idea, and that should help, I think.
Probably worth to have {2,3,4} sequence as a new high level function.

> 
> If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.
> I can document this as part of the documentation (I do not think API documentation is required for this).
> 
> >
> > > +
> > > +			lcore_mask &= ~(1UL << l);
> > > +		}
> > > +
> > > +		if (lcore_mask == 0)
> > > +			return 1;
> > > +
> > > +		rte_pause();
> > > +	} while (wait);
> > > +
> > > +	return 0;
> > > +}
> > > +
Previous message: [dpdk-dev] [RFC 2/3] tqs: add thread quiescent state library
Next message: [dpdk-dev] [RFC 3/3] test/tqs: Add API and functional tests
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list