[dpdk-dev] [PATCH v7 00/10] New sync modes for ring

David Marchand david.marchand at redhat.com
Tue Apr 21 13:31:39 CEST 2020


On Mon, Apr 20, 2020 at 2:28 PM Konstantin Ananyev
<konstantin.ananyev at intel.com> wrote:
> These days more and more customers use(/try to use) DPDK based apps within
> overcommitted systems (multiple acttive threads over same pysical cores):
> VM, container deployments, etc.
> One quite common problem they hit:
> Lock-Holder-Preemption/Lock-Waiter-Preemption with rte_ring.
> LHP is quite a common problem for spin-based sync primitives
> (spin-locks, etc.) on overcommitted systems.
> The situation gets much worse when some sort of
> fair-locking technique is used (ticket-lock, etc.).
> As now not only lock-owner but also lock-waiters scheduling
> order matters a lot (LWP).
> These two problems are well-known for kernel within VMs:
> http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> The problem with rte_ring is that while head accusion is sort of
> un-fair locking, waiting on tail is very similar to ticket lock schema -
> tail has to be updated in particular order.
> That makes current rte_ring implementation to perform
> really pure on some overcommited scenarios.
> It is probably not possible to completely resolve LHP problem in
> userspace only (without some kernel communication/intervention).
> But removing fairness at tail update helps to avoid LWP and
> can mitigate the situation significantly.
> This patch proposes two new optional ring synchronization modes:
> 1) Head/Tail Sync (HTS) mode
> In that mode enqueue/dequeue operation is fully serialized:
>     only one thread at a time is allowed to perform given op.
>     As another enhancement provide ability to split enqueue/dequeue
>     operation into two phases:
>       - enqueue/dequeue start
>       - enqueue/dequeue finish
>     That allows user to inspect objects in the ring without removing
>     them from it (aka MT safe peek).
> 2) Relaxed Tail Sync (RTS)
> The main difference from original MP/MC algorithm is that
> tail value is increased not by every thread that finished enqueue/dequeue,
> but only by the last one.
> That allows threads to avoid spinning on ring tail value,
> leaving actual tail value change to the last thread in the update queue.
>
> Note that these new sync modes are optional.
> For current rte_ring users nothing should change
> (both in terms of API/ABI and performance).
> Existing sync modes MP/MC,SP/SC kept untouched, set up in the same
> way (via flags and _init_), and MP/MC remains as default one.
> The only thing that changed:
> Format of prod/cons now could differ depending on mode selected at _init_.
> So user has to stick with one sync model through whole ring lifetime.
> In other words, user can't create a ring for let say SP mode and then
> in the middle of data-path change his mind and start using MP_RTS mode.
> For existing modes (SP/MP, SC/MC) format remains the same and
> user can still use them interchangeably, though of course it is an
> error prone practice.
>
> Test results on IA (see below) show significant improvements
> for average enqueue/dequeue op times on overcommitted systems.
> For 'classic' DPDK deployments (one thread per core) original MP/MC
> algorithm still shows best numbers, though for 64-bit target
> RTS numbers are not that far away.
> Numbers were produced by new UT test-case: ring_stress_autotest, i.e.:
> echo ring_stress_autotest | ./dpdk-test -n 4 --lcores='...'
>
> X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> DEQ+ENQ average cycles/obj
>                                                 MP/MC      HTS     RTS
> 1thread at 1core(--lcores=6-7)                     8.00       8.15    8.99
> 2thread at 2core(--lcores=6-8)                     19.14      19.61   20.35
> 4thread at 4core(--lcores=6-10)                    29.43      29.79   31.82
> 8thread at 8core(--lcores=6-14)                    110.59     192.81  119.50
> 16thread at 16core(--lcores=6-22)                  461.03     813.12  495.59
> 32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51
>
> 2thread at 1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
> 4thread at 2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88   80.05
> 8thread at 2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72  318.79
> 16thread at 2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59  1144.02 1175.14
> 32thread at 2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68
>
> 8thread at 2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
> 16thread at 4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35  678.29
> 32thread at 4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36 2714.12
>
> i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> DEQ+ENQ average cycles/obj
>                                                 MP/MC      HTS     RTS
> 1thread at 1core(--lcores=6-7)                     7.85       12.13   11.31
> 2thread at 2core(--lcores=6-8)                     17.89      24.52   21.86
> 8thread at 8core(--lcores=6-14)                    32.58      354.20  54.58
> 32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91
>
> 2thread at 1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
> 8thread at 2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61  361.57
> 16thread at 2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65
>
> 8thread at 2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
> 32thread at 4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44 3028.87

I fixed a couple of typos and split the doc updates.

Series applied with the patch from Pavan.
Thanks for the work Konstantin, Honnappa.


-- 
David Marchand



More information about the dev mailing list