[dpdk-dev] [RFC 0/6] New sync modes for ring
Honnappa Nagarahalli
Honnappa.Nagarahalli at arm.com
Mon Mar 30 23:29:36 CEST 2020
<snip>
> >
> > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> > >
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > properly,
> > As per the current rules, these changes (in the current form) will be
> > accepted only for 20.11 release. How do we address this for immediate
> requirements like RCU defer APIs?
>
> I think I found a way to introduce these new modes without API/ABI breakage.
> Working on v1 right now. Plan to submit it by end of that week/start of next
> one.
ok
>
> > I suggest that we move forward with my RFC (taking into consideration your
> feedback) to make progress on RCU APIs.
> >
> > > etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below) on
> > > their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of fair-locking
> > > technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling order
> > > matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock
> > > schema - tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform really pure on
> > > some overcommited scenarios.
> > > While it is probably not possible to completely resolve this problem
> > > in userspace only (without some kernel communication/intervention),
> > > removing fairness in tail update can mitigate it significantly.
> > > So this RFC proposes two new optional ring synchronization modes:
> > > 1) Head/Tail Sync (HTS) mode
> > > In that mode enqueue/dequeue operation is fully serialized:
> > > only one thread at a time is allowed to perform given op.
> > > As another enhancement provide ability to split enqueue/dequeue
> > > operation into two phases:
> > > - enqueue/dequeue start
> > > - enqueue/dequeue finish
> > > That allows user to inspect objects in the ring without removing
> > > them from it (aka MT safe peek).
> > IMO, this will not address the problem described above.
>
> It does, please see the results produced by ring_stress_*autotest below.
> Let say for test-case: 8thread at 2core(--lcores='6,(10-13)@7,(20-23)@8' it
Had not looked at these tests. Please see the numbers below.
> shows:
> avg number of cycles per object for enqueue /dequeue:
> MP/MC: 280314.32
> HTS: 294.72
> RTS: 318.79
>
> Customer who tried it reported similar level of improvement.
Is this tested with the VM/Container setup described in the slides you referred to?
> Actually if you have time - would be very interesting to see what numbers will
> be on ARM boxes.
> To reproduce, just:
> $cat ring_tests_u4
> ring_stress_autotest
> ring_stress_hts_autotest
> ring_stress_rts_autotest
>
> /app/test/dpdk-test --lcores='6,(10-13)@7,(20-23)@8' -n 4 < ring_tests_u4
> 2>&1 | tee res1
>
> Then look at the ' AGGREGATE' stats.
> Right now it is a bit too verbose, so probably the easiest thing to extract same
> numbers quickly:
> grep 'cycles/obj' res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9==0) print
> $(NF);}'
> 280314.32
> 1057833.55
> 294.72
> 480.10
> 318.79
> 461.52
>
> First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS.
12305.05
12027.09
3.59
7.37
4.41
7.98
>
> > For ex: when a producer updates the head and gets scheduled out, other
> > producers have to spin.
>
> Sure, as I wrote in original cover letter:
> " While it is probably not possible to completely resolve this problem in
> userspace only (without some kernel communication/intervention), removing
> fairness in tail update can mitigate it significantly."
> Results from the ring_stress_*_autotest confirm that.
>
> > The problem is probably worse as with non-HTS case moving of the head
> > and copying of the ring elements can happen in parallel between the
> producers (similarly for consumers).
>
> Yes as we serialize the ring, we remove possibility of simultaneous copy.
> That's why for 'normal' cases (one thread per core) original MP/MC is usually
> faster.
> Though on overcommitted scenarios current MP/MC performance degrades
> dramatically.
> The main problem with current MP/MC implementation is in that tail update
> have to be done in strict order (sort of fair locking scheme).
> Which means that we have much-much worse LHP manifestation, then when
> we use unfair schemes.
> With serialized ring (HTS) we remove that ordering completely (same idea as
> switch from fair to unfair locking for PV spin-locks).
>
> > IMO, HTS should not be a configurable flag.
>
> Why?
>
> > In RCU requirement, a MP enqueue and HTS dequeue are required.
>
> This is supported, user can specify different modes for consumer and
> producer:
> (0 | RING_F_MC_HTS_DEQ).
> Then it is up to the user either to call generic
> rte_ring_enqueue/rte_ring_dequeue,
> or specify mode manually by function name:
> rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk.
Ok, that should be good.
>
> >
> > > 2) Relaxed Tail Sync (RTS)
> > > The main difference from original MP/MC algorithm is that tail value
> > > is increased not by every thread that finished enqueue/dequeue, but
> > > only by the last one.
> > > That allows threads to avoid spinning on ring tail value, leaving
> > > actual tail value change to the last thread in the update queue.
> > This can be a configurable flag on the ring.
> > I am not sure how this solves the problem you have stated above
> > completely. Updating the count from all intermediate threads is still
> > required to update the value of the head. But yes, it reduces the severity of
> the problem by not enforcing the order in which the tail is updated.
>
> As I said above, main source of slowdown here - that we have to update tail in
> particular order.
> So the main objective (same as for HTS) is to remove that ordering.
>
> > I also think it introduces the problem on the other side of the ring
> > because the tail is not updated soon enough (the other side has to wait
> longer for the elements to become available).
>
> Yes, producer/consumer starvation.
> That's why we need max allowed Head-Tail-Distance (htd_max) - to limit how
> far head can go away from tail.
>
> > It also introduces another configuration parameter (HTD_MAX_DEF) which
> > they have to deal with.
>
> If user doesn't provide any value, it will be set by default to ring.capacity / 8.
> From my measurements works quite well.
> Though there possibility for the user to set another value, if needed.
>
> > Users have to still implement the current hypervisor related solutions.
>
> Didn't get what you trying to say with that phrase.
The references you provided talked about resolving LHP by doing co-scheduling of vCPUs (which I think could be applied to DPDK applications). I am saying that we still need such mechanisms along with these solutions.
>
> > IMO, we should run the benchmark for this on an over committed setup to
> understand the benefits.
>
> That's why I created ring_stress_*autotest test-cases and collected numbers
> provided below.
> I suppose they clearly show the problem on overcommitted scenarios, and
> how RTS/HTS improve that situation.
> Would appreciate if you repeat these tests on your machines.
>
> >
> > >
> > > Test results on IA (see below) show significant improvements for
> > > average enqueue/dequeue op times on overcommitted systems.
> > > For 'classic' DPDK deployments (one thread per core) original MP/MC
> > > algorithm still shows best numbers, though for 64-bit target RTS
> > > numbers are not that far away.
> > > Numbers were produced by ring_stress_*autotest (first patch in these
> series).
> > >
> > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > > MP/MC HTS RTS
> > > 1thread at 1core(--lcores=6-7) 8.00 8.15 8.99
> > > 2thread at 2core(--lcores=6-8) 19.14 19.61 20.35
> > > 4thread at 4core(--lcores=6-10) 29.43 29.79 31.82
> > > 8thread at 8core(--lcores=6-14) 110.59 192.81 119.50
> > > 16thread at 16core(--lcores=6-22) 461.03 813.12 495.59
> > > 32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51
> > >
> > > 2thread at 1core(--lcores='6,(10-11)@7' 20140.50 23.58 25.14
> > > 4thread at 2core(--lcores='6,(10-11)@7,(20-21)@8' 153680.60 76.88
> 80.05
> > > 8thread at 2core(--lcores='6,(10-13)@7,(20-23)@8' 280314.32 294.72
> > > 318.79 16thread at 2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59
> > > 1144.02
> > > 1175.14 32thread at 2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80
> > > 4627.48 4892.68
> > >
> > > 8thread at 2core(--lcores='6,(10-17)@(7,8))' 321085.98 298.59 307.47
> > > 16thread at 4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35
> 678.29
> > > 32thread at 4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36
> 2714.12
> > >
> > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > > MP/MC HTS RTS
> > > 1thread at 1core(--lcores=6-7) 7.85 12.13 11.31
> > > 2thread at 2core(--lcores=6-8) 17.89 24.52 21.86
> > > 8thread at 8core(--lcores=6-14) 32.58 354.20 54.58
> > > 32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91
> > >
> > > 2thread at 1core(--lcores='6,(10-11)@7' 16095.00 36.06 34.74
> > > 8thread at 2core(--lcores='6,(10-13)@7,(20-23)@8' 1140354.54 346.61
> > > 361.57 16thread at 2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86
> > > 1314.90
> > > 1416.65
> > >
> > > 8thread at 2core(--lcores='6,(10-17)@(7,8))' 594358.61 332.70 357.74
> > > 32thread at 4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44
> 3028.87
> > >
> > > Konstantin Ananyev (6):
> > > test/ring: add contention stress test
> > > ring: rework ring layout to allow new sync schemes
> > > ring: introduce RTS ring mode
> > > test/ring: add contention stress test for RTS ring
> > > ring: introduce HTS ring mode
> > > test/ring: add contention stress test for HTS ring
> > >
> > > app/test/Makefile | 3 +
> > > app/test/meson.build | 3 +
> > > app/test/test_pdump.c | 6 +-
> > > app/test/test_ring_hts_stress.c | 28 ++
> > > app/test/test_ring_rts_stress.c | 28 ++
> > > app/test/test_ring_stress.c | 27 ++
> > > app/test/test_ring_stress.h | 477 +++++++++++++++++++
> > > lib/librte_pdump/rte_pdump.c | 2 +-
> > > lib/librte_port/rte_port_ring.c | 12 +-
> > > lib/librte_ring/Makefile | 4 +-
> > > lib/librte_ring/meson.build | 4 +-
> > > lib/librte_ring/rte_ring.c | 84 +++-
> > > lib/librte_ring/rte_ring.h | 619 +++++++++++++++++++++++--
> > > lib/librte_ring/rte_ring_elem.h | 8 +-
> > > lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
> > > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
> > > 16 files changed, 1721 insertions(+), 52 deletions(-) create mode
> > > 100644 app/test/test_ring_hts_stress.c create mode 100644
> > > app/test/test_ring_rts_stress.c create mode 100644
> > > app/test/test_ring_stress.c create mode 100644
> > > app/test/test_ring_stress.h create mode 100644
> > > lib/librte_ring/rte_ring_hts_generic.h
> > > create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
> > >
> > > --
> > > 2.17.1
More information about the dev
mailing list