[dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Mon Oct 21 02:27:53 CEST 2019


> > >
> > > > Subject: Re: [PATCH v4 1/2] lib/ring: apis to support configurable
> > > > element size
> > > >
> > > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the
> > > > >>> results are as
> > > > >> follows. The numbers in brackets are with the code on master.
> > > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > > >>>
> > > > >>> RTE>>ring_perf_elem_autotest
> > > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6
> > > > >>> SP/SC burst enq/dequeue (size: 32): 1 (2) MP/MC burst
> enq/dequeue (size:
> > > > >>> 32): 2
> > > > >>>
> > > > >>> ### Testing empty dequeue ###
> > > > >>> SC empty dequeue: 2.11
> > > > >>> MC empty dequeue: 1.41 (2.11)
> > > > >>>
> > > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > > >>>
> > > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10
> > > > >>> (71.27) SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC
> > > > >>> bulk enq/dequeue
> > > > >>> (size: 32): 25.74 (20.91)
> > > > >>>
> > > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02
> > > > >>> (173.43) SP/SC bulk enq/dequeue (size:
> > > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17
> > > > >>> (46.74)
> > > > >>>
> > > > >>> On one of the Arm platform
> > > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the
> > > > >>> rest are
> > > > >>> ok)
> > > >
> > > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and
> > > > 16 cores/node (SMT=4).  Applied all 3 patches in v5, test results
> > > > are as
> > > > follows:
> > > >
> > > > RTE>>ring_perf_elem_autotest
> > > > ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue:
> > > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8):
> > > > 5 MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue
> > > > (size: 32): 2 MP/MC burst enq/dequeue (size: 32): 2
> > > >
> > > > ### Testing empty dequeue ###
> > > > SC empty dequeue: 7.81
> > > > MC empty dequeue: 7.81
> > > >
> > > > ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > 8): 5.76 MP/MC bulk enq/dequeue (size: 8): 7.66 SP/SC bulk
> > > > enq/dequeue (size: 32): 2.10 MP/MC bulk enq/dequeue (size: 32):
> > > > 2.57
> > > >
> > > > ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue
> > > > (size: 8): 13.13 MP/MC bulk enq/dequeue (size: 8): 13.98 SP/SC
> > > > bulk enq/dequeue (size: 32): 3.41 MP/MC bulk enq/dequeue (size:
> > > > 32): 4.45
> > > >
> > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> 8):
> > > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk
> > > > enq/dequeue
> > > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > > >
> > > > ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > > 8): 63.41 MP/MC bulk enq/dequeue (size: 8): 62.70 SP/SC bulk
> > > > enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > > 32): 22.96
> > > >
> > > Thanks for running this. There is another test 'ring_perf_autotest' which
> provides the numbers with the original implementation. The goal is to make
> sure the numbers with the original implementation are the same as these.
> Can you please run that as well?
> >
> > Honnappa,
> >
> > Your earlier perf report shows the cycles are in less than 1. That's
> > is due to it is using 50 or 100MHz clock in EL0.
> > Please check with PMU counter. See "ARM64 profiling" in
> >
> > http://doc.dpdk.org/guides/prog_guide/profile_app.html
I am aware of this. Unfortunately, it does not work on all the platforms. The kernel team discourages using cycle counter for this purpose.
I have replaced the modulo operation with division (in v6) which adds couple of decimal points to the results.

> >
> >
> > Here is the octeontx2 values. There is a regression in two core cases
> > as you reported earlier in x86.
> >
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 288 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 61 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 21
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.35 MP/MC bulk enq/dequeue (size:
> > 8): 67.36 SP/SC bulk enq/dequeue (size: 32): 13.10 MP/MC bulk
> > enq/dequeue (size: 32): 21.64
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 75.94 MP/MC bulk enq/dequeue (size: 8): 107.66 SP/SC bulk
> > enq/dequeue (size: 32): 24.51 MP/MC bulk enq/dequeue (size: 32): 33.23
> > Test OK
> > RTE>>
> >
> > ---- after applying v5 of the patch ------
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 289 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 40 MP/MC burst enq/dequeue (size: 8): 64 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 22
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 39.73 MP/MC bulk enq/dequeue (size:
> > 8): 69.13 SP/SC bulk enq/dequeue (size: 32): 13.44 MP/MC bulk
> > enq/dequeue (size: 32): 22.00
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 76.02 MP/MC bulk enq/dequeue (size: 8): 112.50 SP/SC bulk
> > enq/dequeue (size: 32): 24.71 MP/MC bulk enq/dequeue (size: 32): 33.34
> > Test OK
> > RTE>>
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 290 MP/MC single enq/dequeue: 503 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 63 SP/SC burst
> > enq/dequeue (size: 32): 11 MP/MC burst enq/dequeue (size: 32): 19
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.92 MP/MC bulk enq/dequeue (size:
> > 8): 62.54 SP/SC bulk enq/dequeue (size: 32): 11.46 MP/MC bulk
> > enq/dequeue (size: 32): 19.89
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 87.55 MP/MC bulk enq/dequeue (size: 8): 99.10 SP/SC bulk
> > enq/dequeue (size: 32): 26.63 MP/MC bulk enq/dequeue (size: 32): 29.91
> > Test OK
> > RTE>>
> 
> it looks like removal of 3/3 and keeping only 1/3 and 2/3 shows better
> results in some cases
> 
> 
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 439
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.48
> SP/SC bulk enq/dequeue (size: 32): 13.40
> MP/MC bulk enq/dequeue (size: 32): 22.03
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 75.94
> MP/MC bulk enq/dequeue (size: 8): 105.84
> SP/SC bulk enq/dequeue (size: 32): 25.11
> MP/MC bulk enq/dequeue (size: 32): 33.48
> Test OK
> RTE>>
> 
> 
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.46
> SP/SC bulk enq/dequeue (size: 32): 13.42
> MP/MC bulk enq/dequeue (size: 32): 22.01
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 76.04
> MP/MC bulk enq/dequeue (size: 8): 104.88
> SP/SC bulk enq/dequeue (size: 32): 24.75
> MP/MC bulk enq/dequeue (size: 32): 34.66
> Test OK
> RTE>>
> 
> 
> >
> >
> >
> > > > Dave


More information about the dev mailing list