[dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform

Yang, Zhiyong zhiyong.yang at intel.com
Thu Dec 8 10:53:12 CET 2016


Hi, Konstantin:

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, December 8, 2016 5:26 PM
> To: Yang, Zhiyong <zhiyong.yang at intel.com>; Thomas Monjalon
> <thomas.monjalon at 6wind.com>
> Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> <bruce.richardson at intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch at intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on
> IA platform
> 
> 
> Hi Zhiyong,
> 
> >
> > HI, Thomas:
> > 	Sorry for late reply. I have been being always considering your
> suggestion.
> >
> > > -----Original Message-----
> > > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > Sent: Friday, December 2, 2016 6:25 PM
> > > To: Yang, Zhiyong <zhiyong.yang at intel.com>
> > > Cc: dev at dpdk.org; yuanhan.liu at linux.intel.com; Richardson, Bruce
> > > <bruce.richardson at intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev at intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce
> rte_memset
> > > on IA platform
> > >
> > > 2016-12-05 16:26, Zhiyong Yang:
> > > > +#ifndef _RTE_MEMSET_X86_64_H_
> > >
> > > Is this implementation specific to 64-bit?
> > >
> >
> > Yes.
> >
> > > > +
> > > > +#define rte_memset memset
> > > > +
> > > > +#else
> > > > +
> > > > +static void *
> > > > +rte_memset(void *dst, int a, size_t n);
> > > > +
> > > > +#endif
> > >
> > > If I understand well, rte_memset (as rte_memcpy) is using the most
> > > recent instructions available (and enabled) when compiling.
> > > It is not adapting the instructions to the run-time CPU.
> > > There is no need to downgrade at run-time the instruction set as it
> > > is obviously not a supported case, but it would be nice to be able
> > > to upgrade a "default compilation" at run-time as it is done in rte_acl.
> > > I explain this case more clearly for reference:
> > >
> > > We can have AVX512 supported in the compiler but disable it when
> > > compiling
> > > (CONFIG_RTE_MACHINE=snb) in order to build a binary running almost
> > > everywhere.
> > > When running this binary on a CPU having AVX512 support, it will not
> > > benefit of the AVX512 improvement.
> > > Though, we can compile an AVX512 version of some functions and use
> > > them only if the running CPU is capable.
> > > This kind of miracle can be achieved in two ways:
> > >
> > > 1/ For generic C code compiled with a recent GCC, a function can be
> > > built for several CPUs thanks to the attribute target_clones.
> > >
> > > 2/ For manually optimized functions using CPU-specific intrinsics or
> > > asm, it is possible to build them with non-default flags thanks to the
> attribute target.
> > >
> > > 3/ For manually optimized files using CPU-specific intrinsics or
> > > asm, we use specifics flags in the makefile.
> > >
> > > The function clone in case 1/ is dynamically chosen at run-time
> > > through ifunc resolver.
> > > The specific functions in cases 2/ and 3/ must chosen at run-time by
> > > initializing a function pointer thanks to rte_cpu_get_flag_enabled().
> > >
> > > Note that rte_hash and software crypto PMDs have a run-time check
> > > with
> > > rte_cpu_get_flag_enabled() but do not override CFLAGS in the Makefile.
> > > Next step for these libraries?
> > >
> > > Back to rte_memset, I think you should try the solution 2/.
> >
> > I have read the ACL code, if I understand well , for complex algo
> > implementation, it is good idea, but Choosing functions at run time
> > will bring some overhead. For frequently  called function Which
> > consumes small cycles, the overhead maybe is more than  the gains
> optimizations brings For example, for most applications in dpdk, memset only
> set N = 10 or 12bytes. It consumes fewer cycles.
> 
> But then what the point to have an rte_memset() using vector instructions at
> all?
> From what you are saying the most common case is even less then SSE
> register size.
> Konstantin

For most cases, memset is used such as memset(address, 0, sizeof(struct xxx)); 
The use case here is small by accident, I only give an example here. 
but rte_memset is introduced to need consider generic case. 
sizeof(struct xxx) is not limited to very small size, such as  less than SSE register size.
I just want to say that the size for the most use case is not very large,  So cycles consumed
Is not large. It is not suited to choose function at run-time since overhead  is considered.

thanks
Zhiyong


More information about the dev mailing list