[PATCH] eal: non-temporal memcpy
    Konstantin Ananyev 
    konstantin.ananyev at huawei.com
       
    Tue Oct 11 11:25:23 CEST 2022
    
    
  
Hi Morten,
 
> Mattias, Konstantin, Honnappa, Stephen,
> 
> In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a
> 16 byte aligned address will be done using non-temporal store instructions.
> 
> Now, I am seriously considering this alternative:
> 
> Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> 
> I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a
> pcap header) followed by packet data.
Sounds like a reasonable idea to me.
> 
> The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the
> end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33
> cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.).
> 
> What do you think?
My guess that for modern high-end x86 CPUs the difference would be neglectable.
Though again, right now it is just my guess, and I don't have a clue what will be impact (if any) on other platforms. 
If we really want to avoid any doubts, then probably the best thing it  to have some sort of micro-bench in our UT that would simulate
some memory(/cache) bound workload plus normal or NT copies.
As a very rough though:
Allocate some big enough memory buffer (size=X) that for sure wouldn't fit into CPU caches.
Then in a loop for each iteration:
    - do N random normal reads/writes from/to that buffer to simulate some memory bound workload.
     (so each iteration cause  some (more or less) constant % of cache-misses).    
    - invoke our memcpy_ex(size=Y) in question K(=32 as DPDK magic number?) times for different memory locations.
Measure amount of cycles it takes for some big number of iterations.
That would probably show us a difference (if any)
between memcpy vs memcpy_ex() or between different implementations of memcpy_ex()
in terms of cache-line saving, etc.  
Again it will probably show at what size>=Y it is worth to start using NT instead of normal copies for such workloads.
By varying X,N,Y,K parameters we can test different scenarios on different platforms.  
> 
> PS: Non-temporal loads are easy to work with, so don't worry about that.
> 
> 
> Med venlig hilsen / Kind regards,
> -Morten Brørup
    
    
More information about the dev
mailing list