[dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Fri Nov 9 01:47:55 CET 2018


> >> >
> >> > 9) Does anyone else facing this problem?
> >Any data on x86?
> >
> [Wang, Yipeng]
> I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk
> and SIMD instruction so there is no obvious throughput drop on both hit
> and miss cases (for hit case, there is about 2.5% drop though).
Do you mean, if the test case has 'hit only' lookups, there is 2.5% drop?

> 
> I manually changed l3fwd  to do single packet lookup instead of bulk. For hit
> case there is no throughput drop.
> For miss case, there is 10% throughput drop.
> 
> I dig into it, as expected, atomic load indeed translates to regular mov on
> x86.
> But since the reordering of the instruction, the compiler(gcc 5.4) cannot
> unroll the for loop to a switch-case like assembly as before.
> So I believe the reason of performance drops on x86 is because compiler
> cannot optimize the code as well as previously.
Thank you. This makes sense.

> I guess this is totally different reason from why your performance drop on
> non-TSO machine. On non-TSO machine, probably the excessive number of
> atomic load causes a lot of overhead.
> 
> A quick fix I found useful on x86 is to read all index together. I am no expert
> on the use of atomic intinsics, but I assume By adding a fence should still
> maintain the correct ordering?
> -       uint32_t key_idx;
> +       uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
>         void *pdata;
>         struct rte_hash_key *k, *keys = h->key_store;
> 
> +       memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES);
> +       __atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
>         for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> -               key_idx = __atomic_load_n(&bkt->key_idx[i],
> -                                         __ATOMIC_ACQUIRE);
> -               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
> +               if (bkt->sig_current[i] == sig && key_idx[i] !=
> + EMPTY_SLOT){
Thank you for your suggestion. I tried it on Arm platforms, unfortunately it did not help. However, the idea of reducing the number of memory orderings addresses the problem. I worked on a hacked patch for the last couple of days. I have tested it with L3FWD data set, it provides good benefits. I have sent it to you and Jerin. Any feedback will be helpful.

> 
> Yipeng


More information about the dev mailing list