[dpdk-dev] [PATCH v1 01/14] ring: remove split cacheline build setting

Olivier Matz olivier.matz at 6wind.com
Wed Mar 1 11:17:53 CET 2017


Hi Bruce,

On Wed, 1 Mar 2017 09:47:03 +0000, Bruce Richardson
<bruce.richardson at intel.com> wrote:
> On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote:
> > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote:  
> > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote:  
> > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson
> > > > wrote:  
> > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote:  
> > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson
> > > > > > wrote:  
> > > > > > > Users compiling DPDK should not need to know or care
> > > > > > > about the arrangement of cachelines in the rte_ring
> > > > > > > structure. Therefore just remove the build option and set
> > > > > > > the structures to be always split. For improved
> > > > > > > performance use 128B rather than 64B alignment since it
> > > > > > > stops the producer and consumer data being on adjacent


You say you see an improved performance on Intel by having an extra
blank cache-line between the producer and consumer data. Do you have an
idea why it behaves like this? Do you think it is related to the
hardware adjacent cache line prefetcher?



> [...]
> > # base code  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 84
> > MP/MC single enq/dequeue: 301
> > SP/SC burst enq/dequeue (size: 8): 20
> > MP/MC burst enq/dequeue (size: 8): 46
> > SP/SC burst enq/dequeue (size: 32): 12
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 7.11
> > MC empty dequeue: 12.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 19.08
> > MP/MC bulk enq/dequeue (size: 8): 46.28
> > SP/SC bulk enq/dequeue (size: 32): 11.89
> > MP/MC bulk enq/dequeue (size: 32): 18.84
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 37.42
> > MP/MC bulk enq/dequeue (size: 8): 73.32
> > SP/SC bulk enq/dequeue (size: 32): 18.69
> > MP/MC bulk enq/dequeue (size: 32): 24.59
> > Test OK
> > 
> > # with ring rework patch  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 84
> > MP/MC single enq/dequeue: 301
> > SP/SC burst enq/dequeue (size: 8): 19
> > MP/MC burst enq/dequeue (size: 8): 45
> > SP/SC burst enq/dequeue (size: 32): 11
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 7.10
> > MC empty dequeue: 12.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 18.59
> > MP/MC bulk enq/dequeue (size: 8): 45.49
> > SP/SC bulk enq/dequeue (size: 32): 11.67
> > MP/MC bulk enq/dequeue (size: 32): 18.65
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 37.41
> > MP/MC bulk enq/dequeue (size: 8): 72.98
> > SP/SC bulk enq/dequeue (size: 32): 18.69
> > MP/MC bulk enq/dequeue (size: 32): 24.59
> > Test OK  
> > RTE>>  
> > 
> > # with ring rework patch + cache-line size change to one on 128BCL
> > target  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 90
> > MP/MC single enq/dequeue: 317
> > SP/SC burst enq/dequeue (size: 8): 20
> > MP/MC burst enq/dequeue (size: 8): 48
> > SP/SC burst enq/dequeue (size: 32): 11
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 8.10
> > MC empty dequeue: 11.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 20.24
> > MP/MC bulk enq/dequeue (size: 8): 48.43
> > SP/SC bulk enq/dequeue (size: 32): 11.01
> > MP/MC bulk enq/dequeue (size: 32): 18.43
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 25.92
> > MP/MC bulk enq/dequeue (size: 8): 69.76
> > SP/SC bulk enq/dequeue (size: 32): 14.27
> > MP/MC bulk enq/dequeue (size: 32): 22.94
> > Test OK  
> > RTE>>  
> 
> So given that there is not much difference here, is the MIN_SIZE i.e.
> forced 64B, your preference, rather than actual cacheline-size?
> 

I don't quite like this macro CACHE_LINE_MIN_SIZE. For me, it does not
mean anything. The reasons for aligning on a cache line size are
straightforward, but when should we need to align on the minimum
cache line size supported by dpdk? For instance, in mbuf structure,
aligning on 64 would make more sense to me.

So, I would prefer using (RTE_CACHE_LINE_SIZE * 2) here. If we don't
want it on some architectures, or if this optimization is only for Intel
(or all archs that need this optim), I think we could have something
like:

/* bla bla */
#ifdef INTEL
#define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE * 2)
#else
#define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
#endif


Olivier


More information about the dev mailing list