[dpdk-dev] [RFC v2 1/2] rcu: add RCU library supporting QSBR mechanism
Honnappa Nagarahalli
Honnappa.Nagarahalli at arm.com
Fri Feb 22 08:07:59 CET 2019
> > <snip>
> >
> > > > > > > > +/**
> > > > > > > > + * RTE thread Quiescent State structure.
> > > > > > > > + */
> > > > > > > > +struct rte_rcu_qsbr {
> > > > > > > > + uint64_t reg_thread_id[RTE_QSBR_BIT_MAP_ELEMS]
> > > > > > > __rte_cache_aligned;
> > > > > > > > + /**< Registered reader thread IDs - reader threads reporting
> > > > > > > > + * on this QS variable represented in a bit map.
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > + uint64_t token __rte_cache_aligned;
> > > > > > > > + /**< Counter to allow for multiple simultaneous QS
> > > > > > > > +queries */
> > > > > > > > +
> > > > > > > > + struct rte_rcu_qsbr_cnt w[RTE_RCU_MAX_THREADS]
> > > > > > > __rte_cache_aligned;
> > > > > > > > + /**< QS counter for each reader thread, counts upto
> > > > > > > > + * current value of token.
> > > > > > >
> > > > > > > As I understand you decided to stick with neutral thread_id
> > > > > > > and let user define what exactly thread_id is (lcore, syste,
> > > > > > > thread id, something
> > > > > else)?
> > > > > > Yes, that is correct. I will reply to the other thread to
> > > > > > continue the
> > > discussion.
> > > > > >
> > > > > > > If so, can you probably get rid of RTE_RCU_MAX_THREADS
> limitation?
> > > > > > I am not seeing this as a limitation. The user can change this
> > > > > > if required. May
> > > > > be I should change it as follows:
> > > > > > #ifndef RTE_RCU_MAX_THREADS
> > > > > > #define RTE_RCU_MAX_THREADS 128 #endif
> > > > >
> > > > > Yep, that's better, though it would still require user to
> > > > > rebuild the code if he would like to increase total number of threads
> supported.
> > > > Agree
> > > >
> > > > > Though it seems relatively simply to extend current code to
> > > > > support dynamic max thread num here (2 variable arrays plus
> > > > > shift value plus
> > > mask).
> > > > Agree, supporting dynamic 'max thread num' is simple. But this
> > > > means memory needs to be allocated to the arrays. The API
> > > > 'rte_rcu_qsbr_init' has to take max thread num as the parameter.
> > > > We also
> > > have to introduce another API to free this memory. This will become
> > > very similar to alloc/free APIs I had in the v1.
> > > > I hope I am following you well, please correct me if not.
> > >
> > > I think we can still leave alloc/free tasks to the user.
> > > We probabply just need extra function rte_rcu_qsbr_size(uint32_
> > > max_threads) to help user calculate required size.
> > > rte_rcu_qsbr_init() might take as an additional parameter 'size' to
> > > make checks.
> > The size is returned by an API provided by the library. Why does it
> > need to be validated again? If 'size' is required for rte_rcu_qsbr_init, it
> could calculate it again.
>
> Just as extra-safety check.
> I don't have strong opinion here - if you think it is overkill, let's drop it.
>
>
> >
> > > Thought about something like that:
> > >
> > > size_t sz = rte_rcu_qsbr_size(max_threads); struct rte_rcu_qsbr
> > > *qsbr = alloc_aligned(CACHE_LINE, sz); rte_rcu_qsbr_init(qsbr,
> max_threads, sz); ...
> > >
> > Do you see any advantage for allowing the user to allocate the memory?
> So user can choose where to allocate the memory (eal malloc, normal malloc,
> stack, something else).
> Again user might decide to make rcu part of some complex data structure - in
> that case he probably would like to allocate one big chunk of memory at once
> and then provide part of it for rcu.
> Or some other usage scenario that I can't predict.
>
I made this change and added performance tests similar to liburcu. With the dynamic memory allocation change the performance of rte_rcu_qsbr_update comes down by 42% - 45% and that of rte_rcu_qsbr_check also comes down by 133% on Arm platform. On x86 (E5-2660 v4 @ 2.00GHz), the results are mixed. rte_rcu_qsbr_update comes down by 15%, but that of rte_rcu_qsbr_check improves.
On the Arm platform, the issue seems to be due to address calculation that needs to happen at run time. If I fix the reg_thread_id array size, I am getting back/improving the performance both for Arm and x86. What this means is, we will still have a max thread limitation, but it will be high - 512 (1 cache line). We could make this 1024 (2 cache lines). However, per thread counter data size will depend on the 'max thread' provided by the user. I think this solution serves your requirement (though with an acceptable constraint not affecting the near future), please let me know what you think.
These changes and the 3 variants of the implementation are present in RFC v3 [1], in case you want to run these tests.
1/5, 2/5 - same as RFC v2 + 1 bug fixed
3/5 - Addition of rte_rcu_qsbr_get_memsize. Memory size for register thread bitmap array as well as per thread counter data is calculated based on max_threads parameter
4/5 - Test cases are modified to use the new API
5/5 - Size of register thread bitmap array is fixed to hold 512 thread IDs. However, the per thread counter data is calculated based on max_threads parameter.
If you do not want to run the tests, you can just look at 3/5 and 5/5.
[1] http://patchwork.dpdk.org/cover/50431/
> > This approach requires the user to call 3 APIs (including memory
> > allocation). These 3 can be abstracted in a rte_rcu_qsbr_alloc API, user has
> to call just 1 API.
> >
> > > Konstantin
> > >
<snip>
More information about the dev
mailing list