[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

Neil Horman nhorman at tuxdriver.com
Tue Jan 20 20:16:24 CET 2015


On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang at intel.com wrote:
> 
> > Main code changes:
> > 
> > 1. Differentiate architectural features based on CPU flags
> > 
> >     a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
> > 
> >     b. Implement separated copy flow specifically optimized for target architecture
> > 
> > 2. Rewrite the memcpy function "rte_memcpy"
> > 
> >     a. Add store aligning
> > 
> >     b. Add load aligning based on architectural features
> > 
> >     c. Put block copy loop into inline move functions for better control of instruction order
> > 
> >     d. Eliminate unnecessary MOVs
> > 
> > 3. Rewrite the inline move functions
> > 
> >     a. Add move functions for unaligned load cases
> > 
> >     b. Change instruction order in copy loops for better pipeline utilization
> > 
> >     c. Use intrinsics instead of assembly code
> > 
> > 4. Remove slow glibc call for constant copies
> > 
> > Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
> 
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?
> 
> 
Fair point.  Though, does glibc implement optimized memcpys per arch?  Or do
they just rely on the __builtin's from gcc to get optimized variants?

Neil



More information about the dev mailing list