[dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency

Jerin Jacob jerin.jacob at caviumnetworks.com
Tue Nov 6 10:10:08 CET 2018


-----Original Message-----
> Date: Tue, 6 Nov 2018 06:07:43 +0000
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
> CC: "bruce.richardson at intel.com" <bruce.richardson at intel.com>,
>  "pablo.de.lara.guarch at intel.com" <pablo.de.lara.guarch at intel.com>,
>  "dev at dpdk.org" <dev at dpdk.org>, "yipeng1.wang at intel.com"
>  <yipeng1.wang at intel.com>, Dharmik Thakkar <Dharmik.Thakkar at arm.com>,
>  "Gavin Hu (Arm Technology China)" <Gavin.Hu at arm.com>, nd <nd at arm.com>,
>  "thomas at monjalon.net" <thomas at monjalon.net>, "ferruh.yigit at intel.com"
>  <ferruh.yigit at intel.com>, "hemant.agrawal at nxp.com"
>  <hemant.agrawal at nxp.com>, "chaozhu at linux.vnet.ibm.com"
>  <chaozhu at linux.vnet.ibm.com>, nd <nd at arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> 
> 
> > > >
> > > > Add lock-free read-write concurrency. This is achieved by the
> > > > following changes.
> > > >
> > > > 1) Add memory ordering to avoid race conditions. The only race
> > > > condition that can occur is -  using the key store element before
> > > > the key write is completed. Hence, while inserting the element the
> > > > release memory order is used. Any other race condition is caught by
> > > > the key comparison. Memory orderings are added only where needed.
> > > > For ex: reads in the writer's context do not need memory ordering as
> > > > there is a single writer.
> > > >
> > > > key_idx in the bucket entry and pdata in the key store element are
> > > > used for synchronisation. key_idx is used to release an inserted
> > > > entry in the bucket to the reader. Use of pdata for synchronisation
> > > > is required due to updation of an existing entry where-in only the
> > > > pdata is updated without updating key_idx.
> > > >
> > > > 2) Reader-writer concurrency issue, caused by moving the keys to
> > > > their alternative locations during key insert, is solved by
> > > > introducing a global counter(tbl_chng_cnt) indicating a change in
> > > > table.
> > > >
> > > > 3) Add the flag to enable reader-writer concurrency during run time.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
> > >
> > > Hi Honnappa,
> > >
> Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.
> 
> I am trying to reproduce this, I will get back with more details soon.

OK. Since RC2 approaching, bit worried about next steps if we cannot
avoid the regression with this new feature for this release.

24% drop of performance regression not seen in past for a specific
usecase.So not sure what we did past for similar situations.

Thomas,
Any thought of this?


> 
> > > This patch is causing _~24%_ performance regression on mpps/core with
> > > 64B packet with l3fwd in EM mode with octeontx.
> > >
> > > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)
> > >
> Have you run with more cores (8, 16)?

Tried with 4, 8, 16 cores, performance drop is linear.


> 
> > > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"
> > >
> > > Observations:
> > > 1) When hash lookup is _success_ then regression is only 3%. Which is
> > > kind of make sense because additional new atomic instructions
> > >
> > > What I meant by lookup is _success_ is:
> > > Configuring traffic gen like below to match lookup as defined
> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> > >
> > > dest.ip      port0    201.0.0.0
> > > src.ip       port0    200.20.0.1
> > > dest.port    port0    102
> > > src.port     port0    12
> > >
> > > dest.ip      port1    101.0.0.0
> > > src.ip       port1    100.10.0.1
> > > dest.port    port1    101
> > > src.port     port1    11
> > >
> > > tx.type      IPv4+TCP
> > >
> > >
> > >
> > > 2) When hash lookup _fails_ the per core mpps regression comes around 24%
> > with 64B packet size.
> > >
> > > What I meant by lookup is _failure_ is:
> > > Configuring traffic gen not to hit the 5 tuples defined in
> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> > >
> > >
> > > 3) perf top _without_ this patch
> > >   37.30%  l3fwd         [.] em_main_loop
> > >   22.40%  l3fwd         [.] rte_hash_lookup
> > >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
> > >    9.70%  l3fwd         [.] nicvf_xmit_pkts
> > >    6.18%  l3fwd         [.] ipv4_hash_crc
> > >    4.77%  l3fwd         [.] nicvf_fill_rbdr
> > >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> > >    1.16%  libc-2.28.so  [.] memcpy
> > >    0.47%  l3fwd         [.] common_ring_mp_enqueue
> > >    0.44%  l3fwd         [.] common_ring_mc_dequeue
> > >    0.03%  l3fwd         [.] strerror_r at plt
> > >
> > > 4) perf top with this patch
> > >
> > >   47.41%  l3fwd         [.] rte_hash_lookup
> > >   23.55%  l3fwd         [.] em_main_loop
> > >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
> > >    6.95%  l3fwd         [.] nicvf_xmit_pkts
> > >    4.63%  l3fwd         [.] ipv4_hash_crc
> > >    3.30%  l3fwd         [.] nicvf_fill_rbdr
> > >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> > >    0.76%  libc-2.28.so  [.] memcpy
> > >    0.30%  l3fwd         [.] common_ring_mp_enqueue
> > >    0.25%  l3fwd         [.] common_ring_mc_dequeue
> > >    0.04%  l3fwd         [.] strerror_r at plt
> > >
> > >
> > > 5) Based on assembly, most of the cycles spends in rte_hash_lookup
> > > around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "if
> > > (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"
> > >
> > >
> > > 6) Since this patch is big and does 3 things are mentioned above, it
> > > is difficult to pin point what is causing the exact issue.
> > >
> > > But, my primary analysis shows the item (1)(adding the atomic barriers).
> > > But I need to spend more cycles to find out the exact causes.
> >
> >
> > + Adding POWERPC maintainer as mostly POWERPC also impacted on this
> > patch.
> > Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
> > on x86, so x86 may not be much impacted.
> >
> > I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)
> > issue.
> >
> > The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
> > operation which causing only around 1% regression.
> >
> > But since this patch has "two" __atomic_load_n in each
> > search_one_bucket() and in the worst case it is looping around 16 time.s i.e
> > "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%
> > drop in lookup success case.
> Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them should be called during a lookup miss. Ideally, the second one should not be called as it is expected that the 'if' statement on the top is supposed to block it. Are you seeing the 2nd atomic load in your perf output?
> Is your failure traffic having multiple flows or single flow?

l3fwd has only 4 entries. So yes, it is single flow.

> 
> >
> > So this patch's regression will be based on how many cycles an LDAR takes on
> > given ARMv8 platform and on how many issue(s) it can issue LDAR
> > instructions at given point of time.
> Agree. There are multiple micro-architectures in Arm eco-system. We should establish few simple rules to make sure algorithms perform well on all the available platforms. I established few rules in VPP and they are working fine so far.

Can you share that rules for everyone's benefit?


> 
> >
> > IMO, This scheme won't work. I think, we are introducing such performance
> > critical feature, we need to put under function pointer scheme so that if an
> > application does not need such feature it can use plain loads.
> >
> IMO, we should do some more debugging before going into exploring other options.

Yes. But, how do we align with v18.11 release?

> 
> > Already we have a lot of flags in the hash library to define the runtime
> > behavior, I think, it makes sense to select function pointer based on such flags
> > and have a performance effective solution based on application requirements.
> >
> IMO, many flags can be removed by doing some cleanup exercise.
> 
> > Just to prove the above root cause analysis, the following patch can fix the
> > performance issue. I know, it is NOT correct in the context of this patch. Just
> > pasting in case, someone want to see the cost of LD vs
> > __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.
> >
> >
> > On a different note, I think, it makes sense to use RCU based structure in these
> > case to avoid performance issue. liburcu has a good hash library for such
> > cases. (very less write and more read cases)
> >
> I think we miss a point here. These changes are based on customer feedback and requests. DPDK is not only providing the implementation, it is providing the APIs as well. IMO, we should make sure the code we have in DPDK supports multiple use cases rather than addressing a narrow use case and pointing somewhere else for others.

Sure. But Introducing a use case should NOT harm other exiting use case
used current example applications.


> 
> >
> > >
> > > The use case like lwfwd in hash mode, where writer does not update
> > > stuff in fastpath(aka insert op) will be impact with this patch.
> > >
> I do not think 'l3fwd in hash mode' application is practical. If the aim of this application is to showcase performance, it should represent real-life use case. It should have hash inserts from control plane along with lookups. We also have to note that, the algorithm without lock-free patch was not usable for certain use cases on all platforms. It was not usable for even more use cases on Arm platforms. So, I am not sure if we should be comparing the numbers from previous algorithm.

In worst case, we need to make compile time or runtime with function
pointer or whatever it takes to fix.


> However, there seem to be some scope for improvement with the data structure layout. I am looking into this further.

OK

> 
> > > 7) Have you checked the l3fwd lookup failure use case in your environment?
> > > if so, please share your observation and if not, could you please check it?
> > >
> > > 8) IMO, Such performance regression is not acceptable for l3fwd use
> > > case where hash insert op will be done in slowpath.
> What is missing in this application is continuous hash adds from slow path.

Yes. That's one use case for hash library. Some application may not need such
use case so it should be driven from application to define the requirements,
especially a feature impacting another usecase's performance.


> 
> > >
> > > 9) Does anyone else facing this problem?
> Any data on x86?
> 
> > >
> > >


More information about the dev mailing list