[dpdk-dev] [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanism

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Apr 26 14:04:58 CEST 2019



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:honnappa.nagarahalli at arm.com]
> Sent: Friday, April 26, 2019 5:40 AM
> To: Ananyev, Konstantin <konstantin.ananyev at intel.com>; stephen at networkplumber.org; paulmck at linux.ibm.com; Kovacevic, Marko
> <marko.kovacevic at intel.com>; dev at dpdk.org
> Cc: honnappa.nagarahalli at arm.com; gavin.hu at arm.com; dharmik.thakkar at arm.com; malvika.gupta at arm.com
> Subject: [PATCH v8 0/4] lib/rcu: add RCU library supporting QSBR mechanism
> 
> Lock-less data structures provide scalability and determinism.
> They enable use cases where locking may not be allowed
> (for ex: real-time applications).
> 
> In the following paras, the term 'memory' refers to memory allocated
> by typical APIs like malloc or anything that is representative of
> memory, for ex: an index of a free element array.
> 
> Since these data structures are lock less, the writers and readers
> are accessing the data structures concurrently. Hence, while removing
> an element from a data structure, the writers cannot return the memory
> to the allocator, without knowing that the readers are not
> referencing that element/memory anymore. Hence, it is required to
> separate the operation of removing an element into 2 steps:
> 
> Delete: in this step, the writer removes the reference to the element from
> the data structure but does not return the associated memory to the
> allocator. This will ensure that new readers will not get a reference to
> the removed element. Removing the reference is an atomic operation.
> 
> Free(Reclaim): in this step, the writer returns the memory to the
> memory allocator, only after knowing that all the readers have stopped
> referencing the deleted element.
> 
> This library helps the writer determine when it is safe to free the
> memory.
> 
> This library makes use of thread Quiescent State (QS). QS can be
> defined as 'any point in the thread execution where the thread does
> not hold a reference to shared memory'. It is upto the application to
> determine its quiescent state. Let us consider the following diagram:
> 
>     Time -------------------------------------------------->
> 
>                                 |     |
>   RT1   $++++****D1****+++***D2*|**+++|+++**D3*****++++$
>                                 |     |
>   RT2          $++++****D1****++|+**D2|***++++++**D3*****++++$
>                                 |     |
>   RT3      $++++****D1****+++***|D2***|++++++**D2*****++++$
>                                 |     |
>                                 |<--->|
>                                Del | Free
>                                    |
>                               Cannot free memory
>                               during this period
>                               (Grace Period)
> 
> RTx - Reader thread
> < and > - Start and end of while(1) loop
> ***Dx*** - Reader thread is accessing the shared data structure Dx.
>            i.e. critical section.
> +++ - Reader thread is not accessing any shared data structure.
>       i.e. non critical section or quiescent state.
> Del - Point in time when the reference to the entry is removed using
>       atomic operation.
> Free - Point in time when the writer can free the entry.
> Grace Period - Time duration between Del and Free, during which memory cannot
>                be freed.
> 
> As shown, thread RT1 accesses data structures D1, D2 and D3. When it is
> accessing D2, if the writer has to remove an element from D2, the
> writer cannot free the memory associated with that element immediately.
> The writer can return the memory to the allocator only after the reader
> stops referencing D2. In other words, reader thread RT1 has to enter
> a quiescent state.
> 
> Similarly, since thread RT3 is also accessing D2, writer has to wait till
> RT3 enters quiescent state as well.
> 
> However, the writer does not need to wait for RT2 to enter quiescent state.
> Thread RT2 was not accessing D2 when the delete operation happened.
> So, RT2 will not get a reference to the deleted entry.
> 
> It can be noted that, the critical sections for D2 and D3 are quiescent states
> for D1. i.e. for a given data structure Dx, any point in the thread execution
> that does not reference Dx is a quiescent state.
> 
> Since memory is not freed immediately, there might be a need for
> provisioning of additional memory, depending on the application requirements.
> 
> It is important to make sure that this library keeps the overhead of
> identifying the end of grace period and subsequent freeing of memory,
> to a minimum. The following paras explain how grace period and critical
> section affect this overhead.
> 
> The writer has to poll the readers to identify the end of grace period.
> Polling introduces memory accesses and wastes CPU cycles. The memory
> is not available for reuse during grace period. Longer grace periods
> exasperate these conditions.
> 
> The length of the critical section and the number of reader threads
> is proportional to the duration of the grace period. Keeping the critical
> sections smaller will keep the grace period smaller. However, keeping the
> critical sections smaller requires additional CPU cycles(due to additional
> reporting) in the readers.
> 
> Hence, we need the characteristics of small grace period and large critical
> section. This library addresses this by allowing the writer to do
> other work without having to block till the readers report their quiescent
> state.
> 
> For DPDK applications, the start and end of while(1) loop (where no
> references to shared data structures are kept) act as perfect quiescent
> states. This will combine all the shared data structure accesses into a
> single, large critical section which helps keep the overhead on the
> reader side to a minimum.
> 
> DPDK supports pipeline model of packet processing and service cores.
> In these use cases, a given data structure may not be used by all the
> workers in the application. The writer does not have to wait for all
> the workers to report their quiescent state. To provide the required
> flexibility, this library has a concept of QS variable. The application
> can create one QS variable per data structure to help it track the
> end of grace period for each data structure. This helps keep the grace
> period to a minimum.
> 
> The application has to allocate memory and initialize a QS variable.
> 
> Application can call rte_rcu_qsbr_get_memsize to calculate the size
> of memory to allocate. This API takes maximum number of reader threads,
> using this variable, as a parameter. Currently, a maximum of 1024 threads
> are supported.
> 
> Further, the application can initialize a QS variable using the API
> rte_rcu_qsbr_init.
> 
> Each reader thread is assumed to have a unique thread ID. Currently, the
> management of the thread ID (for ex: allocation/free) is left to the
> application. The thread ID should be in the range of 0 to
> maximum number of threads provided while creating the QS variable.
> The application could also use lcore_id as the thread ID where applicable.
> 
> rte_rcu_qsbr_thread_register API will register a reader thread
> to report its quiescent state. This can be called from a reader thread.
> A control plane thread can also call this on behalf of a reader thread.
> The reader thread must call rte_rcu_qsbr_thread_online API to start reporting
> its quiescent state.
> 
> Some of the use cases might require the reader threads to make
> blocking API calls (for ex: while using eventdev APIs). The writer thread
> should not wait for such reader threads to enter quiescent state.
> The reader thread must call rte_rcu_qsbr_thread_offline API, before calling
> blocking APIs. It can call rte_rcu_qsbr_thread_online API once the blocking
> API call returns.
> 
> The writer thread can trigger the reader threads to report their quiescent
> state by calling the API rte_rcu_qsbr_start. It is possible for multiple
> writer threads to query the quiescent state status simultaneously. Hence,
> rte_rcu_qsbr_start returns a token to each caller.
> 
> The writer thread has to call rte_rcu_qsbr_check API with the token to get the
> current quiescent state status. Option to block till all the reader threads
> enter the quiescent state is provided. If this API indicates that all the
> reader threads have entered the quiescent state, the application can free the
> deleted entry.
> 
> The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, they
> can be called concurrently from multiple writers even while running
> as worker threads.
> 
> The separation of triggering the reporting from querying the status provides
> the writer threads flexibility to do useful work instead of blocking for the
> reader threads to enter the quiescent state or go offline. This reduces the
> memory accesses due to continuous polling for the status.
> 
> rte_rcu_qsbr_synchronize API combines the functionality of rte_rcu_qsbr_start
> and blocking rte_rcu_qsbr_check into a single API. This API triggers the reader
> threads to report their quiescent state and polls till all the readers enter
> the quiescent state or go offline. This API does not allow the writer to
> do useful work while waiting and also introduces additional memory accesses
> due to continuous polling.
> 
> The reader thread must call rte_rcu_qsbr_thread_offline and
> rte_rcu_qsbr_thread_unregister APIs to remove itself from reporting its
> quiescent state. The rte_rcu_qsbr_check API will not wait for this reader
> thread to report the quiescent state status anymore.
> 
> The reader threads should call rte_rcu_qsbr_update API to indicate that they
> entered a quiescent state. This API checks if a writer has triggered a
> quiescent state query and update the state accordingly.
> 
> Patch v8:
> 1) Library changes
>    a) Symbols prefixed with '__RTE' or 'rte_' as required (Thomas)
>    b) Used PRI?64 macros to support 32b compilation (Thomas)
>    c) Fixed shared library compilation (Thomas)
> 2) Test cases
>    a) Fixed segmentation fault when more than 20 cores are used for testing (Jerin)
>    b) Used PRI?64 macros to support 32b compilation (Thomas)
>    c) Testing done on x86, ThunderX2, Octeon TX, BlueField for 32b(x86 only)/64b,
>       debug/non-debug, shared/static linking, meson/makefile with various
>       number of cores
> 
> Patch v7:
> 1) Library changes
>    a) Added macro RCU_IS_LOCK_CNT_ZERO
>    b) Added lock counter validation to rte_rcu_qsbr_thread_online/
>       rte_rcu_qsbr_thread_offline/rte_rcu_qsbr_thread_register/
>       rte_rcu_qsbr_thread_unregister APIs (Paul)
> 
> Patch v6:
> 1) Library changes
>    a) Fixed and tested meson build on Arm and x86 (Konstantin)
>    b) Moved rte_rcu_qsbr_synchronize API to rte_rcu_qsbr.c
> 
> Patch v5:
> 1) Library changes
>    a) Removed extra alignment in rte_rcu_qsbr_get_memsize API (Paul)
>    b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock APIs (Paul)
>    c) Clarified the need for 64b counters (Paul)
> 2) Test cases
>    a) Added additional performance test cases to benchmark
>       __rcu_qsbr_check_all
>    b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock calls in various test cases
> 3) Documentation
>    a) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock usage description
> 
> Patch v4:
> 1) Library changes
>    a) Fixed the compilation issue on x86 (Konstantin)
>    b) Rebased with latest master
> 
> Patch v3:
> 1) Library changes
>    a) Moved the registered thread ID array to the end of the
>       structure (Konstantin)
>    b) Removed the compile time constant RTE_RCU_MAX_THREADS
>    c) Added code to keep track of registered number of threads
> 
> Patch v2:
> 1) Library changes
>    a) Corrected the RTE_ASSERT checks (Konstantin)
>    b) Replaced RTE_ASSERT with 'if' checks for non-datapath APIs (Konstantin)
>    c) Made rte_rcu_qsbr_thread_register/unregister non-datapath critical APIs
>    d) Renamed rte_rcu_qsbr_update to rte_rcu_qsbr_quiescent (Ola)
>    e) Used rte_smp_mb() in rte_rcu_qsbr_thread_online API for x86 (Konstantin)
>    f) Removed the macro to access the thread QS counters (Konstantin)
> 2) Test cases
>    a) Added additional test cases for removing RTE_ASSERT
> 3) Documentation
>    a) Changed the figure to make it bigger (Marko)
>    b) Spelling and format corrections (Marko)
> 
> Patch v1:
> 1) Library changes
>    a) Changed the maximum number of reader threads to 1024
>    b) Renamed rte_rcu_qsbr_register/unregister_thread to
>       rte_rcu_qsbr_thread_register/unregister
>    c) Added rte_rcu_qsbr_thread_online/offline API. These are optimized
>       version of rte_rcu_qsbr_thread_register/unregister API. These
>       also provide the flexibility for performance when the requested
>       maximum number of threads is higher than the current number of
>       threads.
>    d) Corrected memory orderings in rte_rcu_qsbr_update
>    e) Changed the signature of rte_rcu_qsbr_start API to return the token
>    f) Changed the signature of rte_rcu_qsbr_start API to not take the
>       expected number of QS states to wait.
>    g) Added debug logs
>    h) Added API and programmer guide documentation.
> 
> RFC v3:
> 1) Library changes
>    a) Rebased to latest master
>    b) Added new API rte_rcu_qsbr_get_memsize
>    c) Add support for memory allocation for QSBR variable (Konstantin)
>    d) Fixed a bug in rte_rcu_qsbr_check (Konstantin)
> 2) Testcase changes
>    a) Separated stress tests into a performance test case file
>    b) Added performance statistics
> 
> RFC v2:
> 1) Cover letter changes
>    a) Explian the parameters that affect the overhead of using RCU
>       and their effect
>    b) Explain how this library addresses these effects to keep
>       the overhead to minimum
> 2) Library changes
>    a) Rename the library to avoid confusion (Matias, Bruce, Konstantin)
>    b) Simplify the code/remove APIs to keep this library inline with
>       other synchronisation mechanisms like locks (Konstantin)
>    c) Change the design to support more than 64 threads (Konstantin)
>    d) Fixed version map to remove static inline functions
> 3) Testcase changes
>    a) Add boundary and additional functional test cases
>    b) Add stress test cases (Paul E. McKenney)
> 
> Dharmik Thakkar (1):
>   test/rcu_qsbr: add API and functional tests
> 
> Honnappa Nagarahalli (3):
>   rcu: add RCU library supporting QSBR mechanism
>   doc/rcu: add lib_rcu documentation
>   doc: added RCU to the release notes
> 
>  MAINTAINERS                                   |    5 +
>  app/test/Makefile                             |    2 +
>  app/test/autotest_data.py                     |   12 +
>  app/test/meson.build                          |    7 +-
>  app/test/test_rcu_qsbr.c                      | 1014 +++++++++++++++++
>  app/test/test_rcu_qsbr_perf.c                 |  704 ++++++++++++
>  config/common_base                            |    6 +
>  doc/api/doxy-api-index.md                     |    3 +-
>  doc/api/doxy-api.conf.in                      |    1 +
>  .../prog_guide/img/rcu_general_info.svg       |  509 +++++++++
>  doc/guides/prog_guide/index.rst               |    1 +
>  doc/guides/prog_guide/rcu_lib.rst             |  185 +++
>  doc/guides/rel_notes/release_19_05.rst        |    8 +
>  lib/Makefile                                  |    2 +
>  lib/librte_rcu/Makefile                       |   23 +
>  lib/librte_rcu/meson.build                    |    7 +
>  lib/librte_rcu/rte_rcu_qsbr.c                 |  277 +++++
>  lib/librte_rcu/rte_rcu_qsbr.h                 |  641 +++++++++++
>  lib/librte_rcu/rte_rcu_version.map            |   13 +
>  lib/meson.build                               |    2 +-
>  mk/rte.app.mk                                 |    1 +
>  21 files changed, 3420 insertions(+), 3 deletions(-)
>  create mode 100644 app/test/test_rcu_qsbr.c
>  create mode 100644 app/test/test_rcu_qsbr_perf.c
>  create mode 100644 doc/guides/prog_guide/img/rcu_general_info.svg
>  create mode 100644 doc/guides/prog_guide/rcu_lib.rst
>  create mode 100644 lib/librte_rcu/Makefile
>  create mode 100644 lib/librte_rcu/meson.build
>  create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c
>  create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h
>  create mode 100644 lib/librte_rcu/rte_rcu_version.map
> 
> --

Run UT on my box (SKX) for both x86_64 and i686 over 96 cores.
All passed.
Tested-by: Konstantin Ananyev <konstantin.ananyev at intel.com>

> 2.17.1



More information about the dev mailing list