[dpdk-dev] [PATCH v4 1/3] rcu: add RCU library supporting QSBR mechanism

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Mon Apr 15 21:46:28 CEST 2019


> 
> > > > >
> > > > > On Wed, Apr 10, 2019 at 06:20:04AM -0500, Honnappa Nagarahalli
> > > wrote:
> > > > > > Add RCU library supporting quiescent state based memory
> > > > > > reclamation
> > > > > method.
> > > > > > This library helps identify the quiescent state of the reader
> > > > > > threads so that the writers can free the memory associated
> > > > > > with the lock less data structures.
> > > > >
> > > > > I don't see any sign of read-side markers (rcu_read_lock() and
> > > > > rcu_read_unlock() in the Linux kernel, userspace RCU, etc.).
> > > > >
> > > > > Yes, strictly speaking, these are not needed for QSBR to
> > > > > operate, but they
> > > > These APIs would be empty for QSBR.
> > > >
> > > > > make it way easier to maintain and debug code using RCU.  For
> > > > > example, given the read-side markers, you can check for errors
> > > > > like having a call to
> > > > > rte_rcu_qsbr_quiescent() in the middle of a reader quite easily.
> > > > > Without those read-side markers, life can be quite hard and you
> > > > > will really hate yourself for failing to have provided them.
> > > >
> > > > Want to make sure I understood this, do you mean the application
> > > would mark before and after accessing the shared data structure on
> > > the reader side?
> > > >
> > > > rte_rcu_qsbr_lock()
> > > > <begin access shared data structure> ...
> > > > ...
> > > > <end access shared data structure>
> > > > rte_rcu_qsbr_unlock()
> > >
> > > Yes, that is the idea.
> > >
> > > > If someone is debugging this code, they have to make sure that
> > > > there is
> > > an unlock for every lock and there is no call to
> > > rte_rcu_qsbr_quiescent in between.
> > > > It sounds good to me. Obviously, they will not add any additional
> > > > cycles
> > > as well.
> > > > Please let me know if my understanding is correct.
> > >
> > > Yes.  And in some sort of debug mode, you could capture the counter
> > > at
> > > rte_rcu_qsbr_lock() time and check it at rte_rcu_qsbr_unlock() time.
> > > If the counter has advanced too far (more than one, if I am not too
> > > confused) there is a bug.  Also in debug mode, you could have
> > > rte_rcu_qsbr_lock() increment a per-thread counter and
> rte_rcu_qsbr_unlock() decrement it.
> > > If the counter is non-zero at a quiescent state, there is a bug.
> > > And so on.
> > >
> > Added this in V5
> >
> > <snip>
> >
> > > > > > +
> > > > > > +/* Get the memory size of QSBR variable */ size_t
> > > > > > +__rte_experimental rte_rcu_qsbr_get_memsize(uint32_t
> > > max_threads) {
> > > > > > +	size_t sz;
> > > > > > +
> > > > > > +	if (max_threads == 0) {
> > > > > > +		rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > > +			"%s(): Invalid max_threads %u\n",
> > > > > > +			__func__, max_threads);
> > > > > > +		rte_errno = EINVAL;
> > > > > > +
> > > > > > +		return 1;
> > > > > > +	}
> > > > > > +
> > > > > > +	sz = sizeof(struct rte_rcu_qsbr);
> > > > > > +
> > > > > > +	/* Add the size of quiescent state counter array */
> > > > > > +	sz += sizeof(struct rte_rcu_qsbr_cnt) * max_threads;
> > > > > > +
> > > > > > +	/* Add the size of the registered thread ID bitmap array */
> > > > > > +	sz += RTE_QSBR_THRID_ARRAY_SIZE(max_threads);
> > > > > > +
> > > > > > +	return RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
> > > > >
> > > > > Given that you align here, should you also align in the earlier
> > > > > steps in the computation of sz?
> > > >
> > > > Agree. I will remove the align here and keep the earlier one as
> > > > the intent
> > > is to align the thread ID array.
> > >
> > > Sounds good!
> > Added this in V5
> >
> > >
> > > > > > +}
> > > > > > +
> > > > > > +/* Initialize a quiescent state variable */ int
> > > > > > +__rte_experimental rte_rcu_qsbr_init(struct rte_rcu_qsbr *v,
> > > uint32_t max_threads) {
> > > > > > +	size_t sz;
> > > > > > +
> > > > > > +	if (v == NULL) {
> > > > > > +		rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > > +			"%s(): Invalid input parameter\n", __func__);
> > > > > > +		rte_errno = EINVAL;
> > > > > > +
> > > > > > +		return 1;
> > > > > > +	}
> > > > > > +
> > > > > > +	sz = rte_rcu_qsbr_get_memsize(max_threads);
> > > > > > +	if (sz == 1)
> > > > > > +		return 1;
> > > > > > +
> > > > > > +	/* Set all the threads to offline */
> > > > > > +	memset(v, 0, sz);
> > > > >
> > > > > We calculate sz here, but it looks like the caller must also
> > > > > calculate it in order to correctly allocate the memory
> > > > > referenced by the "v" argument to this function, with bad things
> > > > > happening if the two calculations get different results.  Should
> > > > > "v" instead be allocated within this function to avoid this sort of
> problem?
> > > >
> > > > Earlier version allocated the memory with-in this library.
> > > > However, it was
> > > decided to go with the current implementation as it provides
> > > flexibility for the application to manage the memory as it sees fit.
> > > For ex: it could allocate this as part of another structure in a
> > > single allocation. This also falls inline with similar approach taken in
> other libraries.
> > >
> > > So the allocator APIs vary too much to allow a pointer to the
> > > desired allocator function to be passed in?  Or do you also want to
> > > allow static allocation?  If the latter, would a DEFINE_RTE_RCU_QSBR()
> be of use?
> > >
> > This is done to allow for allocation of memory for QS variable as part
> > of a another bigger data structure. This will help in not fragmenting the
> memory. For ex:
> >
> > struct xyz {
> >     rte_ring *ring;
> >     rte_rcu_qsbr *v;
> >     abc *t;
> > };
> > struct xyz c;
> >
> > Memory for the above structure can be allocated in one chunk by
> calculating the size required.
> >
> > In some use cases static allocation might be enough as 'max_threads'
> > might be a compile time constant. I am not sure on how to support both
> dynamic and static 'max_threads'.
> 
> Same thought here-  would be good to have a static initializer
> (DEFINE_RTE_RCU_QSBR), but that means new compile time limit
> ('max_threads') - thing that we try to avoid.
> 
> >
> > > > > > +	v->max_threads = max_threads;
> > > > > > +	v->num_elems = RTE_ALIGN_MUL_CEIL(max_threads,
> > > > > > +			RTE_QSBR_THRID_ARRAY_ELM_SIZE) /
> > > > > > +			RTE_QSBR_THRID_ARRAY_ELM_SIZE;
> > > > > > +	v->token = RTE_QSBR_CNT_INIT;
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/* Register a reader thread to report its quiescent state
> > > > > > + * on a QS variable.
> > > > > > + */
> > > > > > +int __rte_experimental
> > > > > > +rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned
> > > > > > +int
> > > > > > +thread_id) {
> > > > > > +	unsigned int i, id, success;
> > > > > > +	uint64_t old_bmap, new_bmap;
> > > > > > +
> > > > > > +	if (v == NULL || thread_id >= v->max_threads) {
> > > > > > +		rte_log(RTE_LOG_ERR, rcu_log_type,
> > > > > > +			"%s(): Invalid input parameter\n", __func__);
> > > > > > +		rte_errno = EINVAL;
> > > > > > +
> > > > > > +		return 1;
> > > > > > +	}
> > > > > > +
> > > > > > +	id = thread_id & RTE_QSBR_THRID_MASK;
> > > > > > +	i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT;
> > > > > > +
> > > > > > +	/* Make sure that the counter for registered threads does
> not
> > > > > > +	 * go out of sync. Hence, additional checks are required.
> > > > > > +	 */
> > > > > > +	/* Check if the thread is already registered */
> > > > > > +	old_bmap =
> __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i),
> > > > > > +					__ATOMIC_RELAXED);
> > > > > > +	if (old_bmap & 1UL << id)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	do {
> > > > > > +		new_bmap = old_bmap | (1UL << id);
> > > > > > +		success = __atomic_compare_exchange(
> > > > > > +
> 	RTE_QSBR_THRID_ARRAY_ELM(v, i),
> > > > > > +					&old_bmap, &new_bmap, 0,
> > > > > > +					__ATOMIC_RELEASE,
> > > > > __ATOMIC_RELAXED);
> > > > > > +
> > > > > > +		if (success)
> > > > > > +			__atomic_fetch_add(&v->num_threads,
> > > > > > +						1,
> __ATOMIC_RELAXED);
> > > > > > +		else if (old_bmap & (1UL << id))
> > > > > > +			/* Someone else registered this thread.
> > > > > > +			 * Counter should not be incremented.
> > > > > > +			 */
> > > > > > +			return 0;
> > > > > > +	} while (success == 0);
> > > > >
> > > > > This would be simpler if threads were required to register
> themselves.
> > > > > Maybe you have use cases requiring registration of other
> > > > > threads, but this capability is adding significant complexity,
> > > > > so it might be worth some thought.
> > > > >
> > > > It was simple earlier (__atomic_fetch_or). The complexity is added
> > > > as
> > > 'num_threads' should not go out of sync.
> > >
> > > Hmmm...
> > >
> > > So threads are allowed to register other threads?  Or is there some
> > > other reason that concurrent registration is required?
> > >
> > Yes, control plane threads can register the fast path threads. Though,
> > I am not sure how useful it is. I did not want to add the restriction. I
> expect that reader threads will register themselves. The reader threads
> require concurrent registration as they all will be running in parallel.
> > If the requirement of keeping track of the number of threads registered
> currently goes away, then this function will be simple.
> >
> > <snip>
> >
> > > > > > diff --git a/lib/librte_rcu/rte_rcu_qsbr.h
> > > > > > b/lib/librte_rcu/rte_rcu_qsbr.h new file mode 100644 index
> > > > > > 000000000..ff696aeab
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_rcu/rte_rcu_qsbr.h
> > > > > > @@ -0,0 +1,554 @@
> > > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > > + * Copyright (c) 2018 Arm Limited  */
> > > > > > +
> > > > > > +#ifndef _RTE_RCU_QSBR_H_
> > > > > > +#define _RTE_RCU_QSBR_H_
> > > > > > +
> > > > > > +/**
> > > > > > + * @file
> > > > > > + * RTE Quiescent State Based Reclamation (QSBR)
> > > > > > + *
> > > > > > + * Quiescent State (QS) is any point in the thread execution
> > > > > > + * where the thread does not hold a reference to a data
> > > > > > +structure
> > > > > > + * in shared memory. While using lock-less data structures,
> > > > > > +the writer
> > > > > > + * can safely free memory once all the reader threads have
> > > > > > +entered
> > > > > > + * quiescent state.
> > > > > > + *
> > > > > > + * This library provides the ability for the readers to
> > > > > > +report quiescent
> > > > > > + * state and for the writers to identify when all the readers
> > > > > > +have
> > > > > > + * entered quiescent state.
> > > > > > + */
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#include <stdio.h>
> > > > > > +#include <stdint.h>
> > > > > > +#include <errno.h>
> > > > > > +#include <rte_common.h>
> > > > > > +#include <rte_memory.h>
> > > > > > +#include <rte_lcore.h>
> > > > > > +#include <rte_debug.h>
> > > > > > +#include <rte_atomic.h>
> > > > > > +
> > > > > > +extern int rcu_log_type;
> > > > > > +
> > > > > > +#if RTE_LOG_DP_LEVEL >= RTE_LOG_DEBUG #define
> > > RCU_DP_LOG(level,
> > > > > fmt,
> > > > > > +args...) \
> > > > > > +	rte_log(RTE_LOG_ ## level, rcu_log_type, \
> > > > > > +		"%s(): " fmt "\n", __func__, ## args) #else #define
> > > > > > +RCU_DP_LOG(level, fmt, args...) #endif
> > > > > > +
> > > > > > +/* Registered thread IDs are stored as a bitmap of 64b
> > > > > > +element
> > > array.
> > > > > > + * Given thread id needs to be converted to index into the
> > > > > > +array and
> > > > > > + * the id within the array element.
> > > > > > + */
> > > > > > +#define RTE_QSBR_THRID_ARRAY_ELM_SIZE (sizeof(uint64_t) * 8)
> > > > > #define
> > > > > > +RTE_QSBR_THRID_ARRAY_SIZE(max_threads) \
> > > > > > +	RTE_ALIGN(RTE_ALIGN_MUL_CEIL(max_threads, \
> > > > > > +		RTE_QSBR_THRID_ARRAY_ELM_SIZE) >> 3,
> > > > > RTE_CACHE_LINE_SIZE) #define
> > > > > > +RTE_QSBR_THRID_ARRAY_ELM(v, i) ((uint64_t *) \
> > > > > > +	((struct rte_rcu_qsbr_cnt *)(v + 1) + v->max_threads) + i)
> > > > > > +#define RTE_QSBR_THRID_INDEX_SHIFT 6 #define
> > > RTE_QSBR_THRID_MASK
> > > > > > +0x3f
> > > > > #define
> > > > > > +RTE_QSBR_THRID_INVALID 0xffffffff
> > > > > > +
> > > > > > +/* Worker thread counter */
> > > > > > +struct rte_rcu_qsbr_cnt {
> > > > > > +	uint64_t cnt;
> > > > > > +	/**< Quiescent state counter. Value 0 indicates the thread
> > > > > > +is offline */ } __rte_cache_aligned;
> > > > > > +
> > > > > > +#define RTE_QSBR_CNT_THR_OFFLINE 0 #define
> > > RTE_QSBR_CNT_INIT 1
> > > > > > +
> > > > > > +/* RTE Quiescent State variable structure.
> > > > > > + * This structure has two elements that vary in size based on
> > > > > > +the
> > > > > > + * 'max_threads' parameter.
> > > > > > + * 1) Quiescent state counter array
> > > > > > + * 2) Register thread ID array  */ struct rte_rcu_qsbr {
> > > > > > +	uint64_t token __rte_cache_aligned;
> > > > > > +	/**< Counter to allow for multiple concurrent quiescent
> > > > > > +state queries */
> > > > > > +
> > > > > > +	uint32_t num_elems __rte_cache_aligned;
> > > > > > +	/**< Number of elements in the thread ID array */
> > > > > > +	uint32_t num_threads;
> > > > > > +	/**< Number of threads currently using this QS variable */
> > > > > > +	uint32_t max_threads;
> > > > > > +	/**< Maximum number of threads using this QS variable */
> > > > > > +
> > > > > > +	struct rte_rcu_qsbr_cnt qsbr_cnt[0] __rte_cache_aligned;
> > > > > > +	/**< Quiescent state counter array of 'max_threads'
> elements
> > > > > > +*/
> > > > > > +
> > > > > > +	/**< Registered thread IDs are stored in a bitmap array,
> > > > > > +	 *   after the quiescent state counter array.
> > > > > > +	 */
> > > > > > +} __rte_cache_aligned;
> > > > > > +
> >
> > <snip>
> >
> > > > > > +
> > > > > > +/* Check the quiescent state counter for registered threads
> > > > > > +only, assuming
> > > > > > + * that not all threads have registered.
> > > > > > + */
> > > > > > +static __rte_always_inline int
> > > > > > +__rcu_qsbr_check_selective(struct rte_rcu_qsbr *v, uint64_t
> > > > > > +t, bool
> > > > > > +wait) {
> > > > > > +	uint32_t i, j, id;
> > > > > > +	uint64_t bmap;
> > > > > > +	uint64_t c;
> > > > > > +	uint64_t *reg_thread_id;
> > > > > > +
> > > > > > +	for (i = 0, reg_thread_id = RTE_QSBR_THRID_ARRAY_ELM(v,
> 0);
> > > > > > +		i < v->num_elems;
> > > > > > +		i++, reg_thread_id++) {
> > > > > > +		/* Load the current registered thread bit map
> before
> > > > > > +		 * loading the reader thread quiescent state
> counters.
> > > > > > +		 */
> > > > > > +		bmap = __atomic_load_n(reg_thread_id,
> > > > > __ATOMIC_ACQUIRE);
> > > > > > +		id = i << RTE_QSBR_THRID_INDEX_SHIFT;
> > > > > > +
> > > > > > +		while (bmap) {
> > > > > > +			j = __builtin_ctzl(bmap);
> > > > > > +			RCU_DP_LOG(DEBUG,
> > > > > > +				"%s: check: token = %lu, wait = %d,
> Bit Map
> > > > > = 0x%lx, Thread ID = %d",
> > > > > > +				__func__, t, wait, bmap, id + j);
> > > > > > +			c = __atomic_load_n(
> > > > > > +					&v->qsbr_cnt[id + j].cnt,
> > > > > > +					__ATOMIC_ACQUIRE);
> > > > > > +			RCU_DP_LOG(DEBUG,
> > > > > > +				"%s: status: token = %lu, wait = %d,
> Thread
> > > > > QS cnt = %lu, Thread ID = %d",
> > > > > > +				__func__, t, wait, c, id+j);
> > > > > > +			/* Counter is not checked for wrap-around
> > > > > condition
> > > > > > +			 * as it is a 64b counter.
> > > > > > +			 */
> > > > > > +			if (unlikely(c !=
> RTE_QSBR_CNT_THR_OFFLINE && c
> > > > > < t)) {
> > > > >
> > > > > This assumes that a 64-bit counter won't overflow, which is
> > > > > close enough to true given current CPU clock frequencies.  ;-)
> > > > >
> > > > > > +				/* This thread is not in quiescent
> state */
> > > > > > +				if (!wait)
> > > > > > +					return 0;
> > > > > > +
> > > > > > +				rte_pause();
> > > > > > +				/* This thread might have
> unregistered.
> > > > > > +				 * Re-read the bitmap.
> > > > > > +				 */
> > > > > > +				bmap =
> __atomic_load_n(reg_thread_id,
> > > > > > +						__ATOMIC_ACQUIRE);
> > > > > > +
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +
> > > > > > +			bmap &= ~(1UL << j);
> > > > > > +		}
> > > > > > +	}
> > > > > > +
> > > > > > +	return 1;
> > > > > > +}
> > > > > > +
> > > > > > +/* Check the quiescent state counter for all threads,
> > > > > > +assuming that
> > > > > > + * all the threads have registered.
> > > > > > + */
> > > > > > +static __rte_always_inline int __rcu_qsbr_check_all(struct
> > > > > > +rte_rcu_qsbr *v, uint64_t t, bool
> > > > > > +wait)
> > > > >
> > > > > Does checking the bitmap really take long enough to make this
> > > > > worthwhile as a separate function?  I would think that the
> > > > > bitmap-checking time would be lost in the noise of cache misses
> > > > > from
> > > the ->cnt loads.
> > > >
> > > > It avoids accessing one cache line. I think this is where the
> > > > savings are
> > > (may be in theory). This is the most probable use case.
> > > > On the other hand, __rcu_qsbr_check_selective() will result in
> > > > savings
> > > (depending on how many threads are currently registered) by avoiding
> > > accessing unwanted counters.
> > >
> > > Do you really expect to be calling this function on any kind of fastpath?
> >
> > Yes. For some of the libraries (rte_hash), the writer is on the fast path.
> >
> > >
> > > > > Sure, if you invoke __rcu_qsbr_check_selective() in a tight loop
> > > > > in the absence of readers, you might see __rcu_qsbr_check_all()
> > > > > being a bit faster.  But is that really what DPDK does?
> > > > I see improvements in the synthetic test case (similar to the one
> > > > you
> > > have described, around 27%). However, in the more practical test
> > > cases I do not see any difference.
> > >
> > > If the performance improvement only occurs in a synthetic test case,
> > > does it really make sense to optimize for it?
> > I had to fix few issues in the performance test cases and added more to
> do the comparison. These changes are in v5.
> > There are 4 performance tests involving this API.
> > 1) 1 Writer, 'N' readers
> >      Writer: qsbr_start, qsbr_check(wait = true)
> >      Readers: qsbr_quiescent
> > 2) 'N' writers
> >      Writers: qsbr_start, qsbr_check(wait == false)
> > 3) 1 Writer, 'N' readers (this test uses the lock-free rte_hash data
> structure)
> >      Writer: hash_del, qsbr_start, qsbr_check(wait = true), validate that
> the reader was able to complete its work successfully
> >      Readers: thread_online, hash_lookup, access the pointer - do some
> > work on it, qsbr_quiescent, thread_offline
> > 4) Same as test 3) but qsbr_check (wait == false)
> >
> > There are 2 sets of these tests.
> > a) QS variable is created with number of threads same as number of
> > readers - this will exercise __rcu_qsbr_check_all
> > b) QS variable is created with 128 threads, number of registered
> > threads is same as in a) - this will exercise
> > __rcu_qsbr_check_selective
> >
> > Following are the results on x86 (E5-2660 v4 @ 2.00GHz) comparing from
> > a) to b) (on x86 in my setup, the results are not very stable between
> > runs)
> > 1) 25%
> > 2) -3%
> > 3) -0.4%
> > 4) 1.38%
> >
> > Following are the results on an Arm system comparing from a) to b)
> > (results are not pretty stable between runs)
                           ^^^
Correction, on the Arm system, the results *are* stable (copy-paste error)

> > 1) -3.45%
> > 2) 0%
> > 3) -0.03%
> > 4) -0.04%
> >
> > Konstantin, is it possible to run the tests on your setup and look at the
> results?
> 
> I did run V5 on my box (SKX 2.1 GHz) with 17 lcores (1 physical core per
> thread).
> Didn't notice any siginifcatn fluctuations between runs, output below.
> 
> >rcu_qsbr_perf_autotesESC[0Kt
> Number of cores provided = 17
> Perf test with all reader threads registered
> --------------------------------------------
> 
> Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU
> updates = 65707232899 Cycles per 1000 updates: 18482 Total RCU checks =
> 20000000 Cycles per 1000 checks: 3794991
> 
> Perf Test: 17 Readers
> Total RCU updates = 1700000000
> Cycles per 1000 updates: 2128
> 
> Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks =
> 340000000 Cycles per 1000 checks: 10030
> 
> Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking
> QSBR Check Following numbers include calls to rte_hash functions Cycles
> per 1 update(online/update/offline): 1984696 Cycles per 1 check(start,
> check): 2619002
> 
> Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-
> Blocking QSBR check Following numbers include calls to rte_hash functions
> Cycles per 1 update(online/update/offline): 2028030 Cycles per 1
> check(start, check): 2876667
> 
> Perf test with some of reader threads registered
> ------------------------------------------------
> 
> Perf Test: 16 Readers/1 Writer('wait' in qsbr_check == true) Total RCU
> updates = 68850073055 Cycles per 1000 updates: 25490 Total RCU checks =
> 20000000 Cycles per 1000 checks: 5484403
> 
> Perf Test: 17 Readers
> Total RCU updates = 1700000000
> Cycles per 1000 updates: 2127
> 
> Perf test: 17 Writers ('wait' in qsbr_check == false) Total RCU checks =
> 340000000 Cycles per 1000 checks: 10034
> 
> Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Blocking
> QSBR Check Following numbers include calls to rte_hash functions Cycles
> per 1 update(online/update/offline): 3604489 Cycles per 1 check(start,
> check): 7077372
> 
> Perf test: 1 writer, 17 readers, 1 QSBR variable, 1 QSBR Query, Non-
> Blocking QSBR check Following numbers include calls to rte_hash functions
> Cycles per 1 update(online/update/offline): 3936831 Cycles per 1
> check(start, check): 7262738
> 
> 
> Test OK
Thanks for running the test. From the numbers, the comparison is as follows:
1) -44%
2) 0.03%
3) -170%
4) -152%

Trend is the same between x86 and Arm. However, x86 has drastic improvement with __rcu_qsbr_check_all function.

> 
> Konstantin



More information about the dev mailing list