[RFC] ring: improve ring performance with C11 atomics
Wathsala Vithanage
wathsala.vithanage at arm.com
Fri Apr 21 21:16:42 CEST 2023
Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
Because to calculate the addresses of the dequeue
elements __rte_ring_dequeue_elems uses the old_head updated by the
__atomic_compare_exchange_n intrinsic used in
__rte_ring_move_prod_head. This results in an address dependency
between the two operations. Therefore __rte_ring_dequeue_elems cannot
happen before __rte_ring_move_prod_head.
Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
won't be reordered either.
Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1) |
+-------------------------------------------------------------------+
| Generic | C11 atomics | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686 | Total count: 812125 |
| | Gain: -15% | Gain: 6% |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1) |
+-------------------------------------------------------------------+
| Generic | C11 atomics | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385 | Total count: 830935 |
| | Gain: -21% | Gain: 2% |
+-------------------------------------------------------------------+
Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 |
+-------------------------------------------------------------------+
| Generic | C11 atomics | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 181640 | Total count: 181995 | Total count: 182791 |
| | Gain: 0.2% | Gain: 0.6%
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 |
+-------------------------------------------------------------------+
| Generic | C11 atomics | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 167495 | Total count: 161536 | Total count: 163190 |
| | Gain: -3.5% | Gain: -2.6% |
+-------------------------------------------------------------------+
Signed-off-by: Wathsala Vithanage <wathsala.vithanage at arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
Reviewed-by: Feifei Wang <feifei.wang2 at arm.com>
---
.mailmap | 1 +
lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans at corigine.com>
Wang Sheng-Hui <shhuiw at gmail.com>
Wangyu (Eric) <seven.wangyu at huawei.com>
Waterman Cao <waterman.cao at intel.com>
+Wathsala Vithanage <wathsala.vithanage at arm.com>
Weichun Chen <weichunx.chen at intel.com>
Wei Dai <wei.dai at intel.com>
Weifeng Li <liweifeng96 at 126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..1895f2bb0e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
if (!single)
rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
+ /*
+ * Updating of ht->tail cannot happen before elements are added to or
+ * removed from the ring, as it could result in data races between
+ * producer and consumer threads. Therefore ht->tail should be updated
+ * with release semantics to prevent ring data copy phase from sinking
+ * below it.
+ */
__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
}
@@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
/* Ensure the head is read before tail */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
- /* load-acquire synchronize with store-release of ht->tail
- * in update_tail.
- */
cons_tail = __atomic_load_n(&r->cons.tail,
- __ATOMIC_ACQUIRE);
+ __ATOMIC_RELAXED);
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
@@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
/* Ensure the head is read before tail */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
- /* this load-acquire synchronize with store-release of ht->tail
- * in update_tail.
- */
prod_tail = __atomic_load_n(&r->prod.tail,
- __ATOMIC_ACQUIRE);
-
+ __ATOMIC_RELAXED);
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
* cons_head > prod_tail). So 'entries' is always between 0
--
2.25.1
More information about the dev
mailing list