[RFC] random: use per lcore state

Morten Brørup mb at smartsharesystems.com
Mon Sep 11 18:37:58 CEST 2023


> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Monday, 11 September 2023 17.04
> 
> On Wed, 6 Sep 2023 19:54:32 +0200
> Morten Brørup <mb at smartsharesystems.com> wrote:
> 
> > >
> > > -	idx = rte_lcore_id();
> > > +	seed = __atomic_load_n(&rte_rand_seed, __ATOMIC_RELAXED);
> > > +	if (unlikely(seed != rand_state->seed)) {
> >
> > Please note that rte_rand_seed lives in a completely different cache
> > line than RTE_PER_LCORE(rte_rand_state), so the comparison with
> > rte_rand_seed requires reading one more cache line than the original
> > implementation, which only uses the cache line holding
> > rand_states[idx].
> >
> > This is in the hot path.
> >
> > If we could register a per-thread INIT function, the lazy
> > initialization could be avoided, and only one cache line accessed.
> 
> Since rte_rand_seed rarely changes, it will get cached by each cpu.

Agreed. And the point I was trying to convey was that this will consume cache on each CPU. Cache is a scarce resource on some hardware.

If reading rte_rand_seed is gated by (bool) rand_state->initialized, the __rte_rand_get_state() function is less likely to evict other data from the cache to make room for rte_rand_seed in the cache.

> The problem was before there was prefetch cache overlap causing false
> sharing.

Yes, and my patch also wastes a cache line per state, because of the cache guard.

So far, we have use this simple function for a discussion of principles and design patterns... so let me bring something new into the discussion, which is relevant on both the abstract and the practical level:

We could keep using the rte_module_data[RTE_MAX_LCORE] design pattern generally. But we could make an exception for data structures significantly smaller than a cache line. If we use TLS for those, we would avoid gaps in the memory allocated for those, which is even more cache efficient.

Let's assume that (unsigned int) RTE_PER_LCORE(_lcore_id) is currently the only data in TLS. If we add (struct rte_rand_state) RTE_PER_LCORE(rand_state), it will go into the same cache line as RTE_PER_LCORE(_lcore_id), and effectively only cost a fraction of a cache line.




More information about the dev mailing list