[dpdk-dev] [snabb-devel] RE: [PATCH 0/4] DPDK memcpy	optimization
    Luke Gorrie 
    luke at snabb.co
       
    Tue Jan 27 14:57:44 CET 2015
    
    
  
Hi again John,
Thank you for the patient answers :-)
Thank you for pointing this out: I was mistakenly testing your Sandy Bridge
code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2).
Correcting that, your code is both the fastest and the smallest in my
humble micro benchmarking tests.
Looks like you have done great work! You probably knew that already :-) but
thank you for walking me through it.
The code compiles to 745 bytes of object code (smaller than glibc 2.20
memcpy) and cachebenches like this:
                Memory Copy Library Cache Test
C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            97587.60        1.00
384             0.01            97628.83        1.00
512             0.01            97613.95        1.00
768             0.01            147811.44       0.66
1024            0.01            158938.68       0.93
1536            0.01            168487.49       0.94
2048            0.01            174278.83       0.97
3072            0.01            156922.58       1.11
4096            0.01            145811.59       1.08
6144            0.01            157388.27       0.93
8192            0.01            149616.95       1.05
12288           0.01            149064.26       1.00
16384           0.01            107895.06       1.38
the key difference from my perspective is that glibc 2.20 memcpy
performance goes way down for >= 2048 bytes when they switch from vector
moves to string moves, while your code stays consistent.
I will take it for a spin in a real application.
Cheers,
-Luke
    
    
More information about the dev
mailing list