[dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements

Bruce Richardson bruce.richardson at intel.com
Mon Oct 3 11:59:07 CEST 2016


On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote:
> This patchset improves lookup performance on the current hash library
> by changing the existing lookup bulk pipeline, with an improved pipeline,
> based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline.
> Also, x86 vectorized intrinsics are used to improve performance when comparing signatures.
> 
> First patch reorganizes the order of the hash structure.
> The structure takes more than one 64-byte cache line, but not all
> the fields are used in the lookup operation (the most common operation).
> Therefore, all these fields have been moved to the first part of the structure,
> so they all fit in one cache line, improving slightly the performance in some
> scenarios.
> 
> Second patch modifies the order of the bucket structure.
> Currently, the buckets store all the signatures together (current and alternative).
> In order to be able to perform a vectorized signature comparison,
> all current signatures have to be together, so the order of the bucket has been changed,
> having separated all the current signatures from the alternative signatures.
> 
> Third patch introduces x86 vectorized intrinsics.
> When performing a lookup bulk operation, all current signatures in a bucket
> are compared against the signature of the key being looked up.
> Now that they all are together, a vectorized comparison can be performed,
> which takes less instructions to be carried out.
> In case of having a machine with AVX2, number of entries per bucket are
> increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers,
> which are the 8 signatures on the bucket.
> 
> Fourth (and last) patch modifies the current pipeline of the lookup bulk function.
> The new pipeline is based on a loop-and-jump model. The two key improvements are:
> 
> - Better prefetching: in this case, first 4 keys to be looked up are prefetched,
>   and after that, the rest of the keys are prefetched at the time the calculation
>   of the signatures are being performed. This gives more time for the CPU to
>   prefetch the data requesting before actually need it, which result in less
>   cache misses and therefore, higher throughput.
> 
> - Lower performance penalty when using fallback: the lookup bulk algorithm
>   assumes that most times there will not be a collision in a bucket, but it might
>   happen that two or more signatures are equal, which means that more than one
>   key comparison might be necessary. In that case, only the key of the first hit is prefetched,
>   like in the current implementation. The difference now is that if this comparison
>   results in a miss, the information of the other keys to be compared has been stored,
>   unlike the current implementation, which needs to perform an entire simple lookup again.
> 
> Changes in v4:
> - Reordered hash structure, so alt signature is at the start
>   of the next cache line, and explain in the commit message
>   why it has been moved
> - Reordered hash structure, so name field is on top of the structure,
>   leaving all the fields used in lookup in the next cache line
>   (instead of the first cache line)
> 
> Changes in v3:
> - Corrected the cover letter (wrong number of patches)
> 
> Changes in v2:
> - Increased entries per bucket from 4 to 8 for all cases,
>   so it is not architecture dependent any longer.
> - Replaced compile-time signature comparison function election
>   with run-time election, so best optimization available
>   will be used from a single binary.
> - Reordered the hash structure, so all the fields used by lookup
>   are in the same cache line (first).
> 
> Byron Marohn (3):
>   hash: reorganize bucket structure
>   hash: add vectorized comparison
>   hash: modify lookup bulk pipeline
> 

Hi,

Firstly, checkpatches is reporting some style errors in these patches.

Secondly, when I run the "hash_multiwriter_autotest" I get what I assume to be
an error after applying this patchset. Before this set is applied, running
that test shows the cycles per insert with/without lock elision. Now, though
I'm getting an error about a key being dropped or failing to insert in the lock
elision case, e.g. 

  Core #2 inserting 1572864: 0 - 1,572,864
  key 1497087 is lost
  1 key lost

I've run the test a number of times, and there is a single key lost each time.
Please check on this, is it expected or is it a problem?

Thanks,
/Bruce



More information about the dev mailing list