[dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform

Yang, Zhiyong zhiyong.yang at intel.com
Fri Dec 16 03:15:39 CET 2016


Hi,Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, December 15, 2016 6:54 PM
> To: Yang, Zhiyong <zhiyong.yang at intel.com>; Thomas Monjalon
> <thomas.monjalon at 6wind.com>
> Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch at intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> Hi Zhiyong,
> 
> > -----Original Message-----
> > From: Yang, Zhiyong
> > Sent: Thursday, December 15, 2016 6:51 AM
> > To: Yang, Zhiyong <zhiyong.yang at intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev at intel.com>; Thomas Monjalon
> > <thomas.monjalon at 6wind.com>
> > Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> > <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch at intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset
> > on IA platform
> >
> > Hi, Thomas, Konstantin:
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Yang, Zhiyong
> > > Sent: Sunday, December 11, 2016 8:33 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>; Thomas
> > > Monjalon <thomas.monjalon at 6wind.com>
> > > Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> > > <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > Hi, Konstantin, Bruce:
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Thursday, December 8, 2016 6:31 PM
> > > > To: Yang, Zhiyong <zhiyong.yang at intel.com>; Thomas Monjalon
> > > > <thomas.monjalon at 6wind.com>
> > > > Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> > > > <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> > > > <pablo.de.lara.guarch at intel.com>
> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > rte_memset on IA platform
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yang, Zhiyong
> > > > > Sent: Thursday, December 8, 2016 9:53 AM
> > > > > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>; Thomas
> > > > > Monjalon <thomas.monjalon at 6wind.com>
> > > > > Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> > > > > <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch at intel.com>
> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> > > > > rte_memset on IA platform
> > > > >
> > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > >
> > > > static inline void*
> > > > rte_memset_huge(void *s, int c, size_t n) {
> > > >    return __rte_memset_vector(s, c, n); }
> > > >
> > > > static inline void *
> > > > rte_memset(void *s, int c, size_t n) {
> > > > 	If (n < XXX)
> > > > 		return rte_memset_scalar(s, c, n);
> > > > 	else
> > > > 		return rte_memset_huge(s, c, n); }
> > > >
> > > > XXX could be either a define, or could also be a variable, so it
> > > > can be setuped at startup, depending on the architecture.
> > > >
> > > > Would that work?
> > > > Konstantin
> > > >
> > I have implemented the code for  choosing the functions at run time.
> > rte_memcpy is used more frequently, So I test it at run time.
> >
> > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > void * rte_memcpy(void *dst, const void *src, size_t n) {
> >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > the overhead at run time, I assign the function address to var
> > rte_memcpy_vector before main() starts to init the var.
> >
> > static void __attribute__((constructor))
> > rte_memcpy_init(void)
> > {
> > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_avx2;
> > 	}
> > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > 	{
> > 		rte_memcpy_vector = rte_memcpy_sse;
> > 	}
> > 	else
> > 	{
> > 		rte_memcpy_vector = memcpy;
> > 	}
> >
> > }
> 
> I thought we discussed a bit different approach.
> In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> after some cutoff point, i.e:
> 
> void
> rte_memcpy(void *dst, const void *src, size_t len) {
> 	if (len < N) memcpy(dst, src, len);
> 	else rte_memcpy_vector(dst, src, len);
> }
> 
> If you just always call rte_memcpy_vector() for every len, then it means that
> compiler most likely has always to generate a proper call (not inlining
> happening).

> For small length(s) price of extra function would probably overweight any
> potential gain with SSE/AVX2 implementation.
> 
> Konstantin

Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset, 
For large lengths, rte_memset is only a bit better than memset. 
because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

>For small length(s) price of extra function would probably overweight any
 >potential gain.  
This is the key point. I think it should include the scalar optimization, not only vector optimization.

The value of rte_memset is always inlined and for small lengths it will be better.
when in some case We are not sure that memset is always inlined by compiler.
It seems that choosing function at run time will lose the gains.
The following is tested on haswell by patch code.
** rte_memset() - memset perf tests
        (C = compile-time constant) **
======== ======= ======== ======= ========
   Size memset in cache  memset in mem
(bytes)        (ticks)        (ticks)
------- -------------- ---------------
============= 32B aligned ================
      3            3 -    8       19 -  128
      4            4 -    8       13 -  128
      8            2 -    7       19 -  128
      9            2 -    7       19 -  127
     12           2 -    7       19 -  127
     17          3 -    8        19 -  132
     64          3 -    8        28 -  168
    128        7 -   13       54 -  200
    255        8 -   20       100 -  223
    511        14 -   20     187 -  314
   1024      24 -   29     328 -  379
   8192     198 -  225   1829 - 2193

Thanks
Zhiyong



More information about the dev mailing list