[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
Neil Horman
nhorman at tuxdriver.com
Tue Jan 20 20:16:24 CET 2015
On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang at intel.com wrote:
>
> > Main code changes:
> >
> > 1. Differentiate architectural features based on CPU flags
> >
> > a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
> >
> > b. Implement separated copy flow specifically optimized for target architecture
> >
> > 2. Rewrite the memcpy function "rte_memcpy"
> >
> > a. Add store aligning
> >
> > b. Add load aligning based on architectural features
> >
> > c. Put block copy loop into inline move functions for better control of instruction order
> >
> > d. Eliminate unnecessary MOVs
> >
> > 3. Rewrite the inline move functions
> >
> > a. Add move functions for unaligned load cases
> >
> > b. Change instruction order in copy loops for better pipeline utilization
> >
> > c. Use intrinsics instead of assembly code
> >
> > 4. Remove slow glibc call for constant copies
> >
> > Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
>
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?
>
>
Fair point. Though, does glibc implement optimized memcpys per arch? Or do
they just rely on the __builtin's from gcc to get optimized variants?
Neil
More information about the dev
mailing list