[dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features

Neil Horman nhorman at tuxdriver.com
Thu Jul 31 21:01:17 CEST 2014


On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > 2014-07-31 09:13, Neil Horman:
> > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > Hey all-
> > > > > > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > > > > > instructions in the specific code we want to build with support for that, and 
> > > > > > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > > > > > be used if the executing cpu supports them at run time.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Neil Horman <nhorman at tuxdriver.com>
> > > > > > > > > CC: Thomas Monjalon <thomas.monjalon at 6wind.com>
> > > > > > > > >
> > > > > > > > I'd prefer if a solution could be found based off your original patch
> > > > > > > > set, as it gives us more chance to deprecate the older code paths in
> > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > > > > > hopefully the __builtin should work for that.
> > > > > > > > 
> > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > > > > > accelerated code paths on systems that can make use of it at run time.  If We
> > > > > > > use the __builtin compiler functions, we will either:
> > > > > > > 
> > > > > > > 1) Build those code paths with advanced instructions that won't work on older
> > > > > > > systems (i.e. crash)
> > > > > > > 
> > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > > > > > speedup execution on systems that are capable of using the more advanced
> > > > > > > instructions.
> > > > > > > 
> > > > > > > Using this run time check, we can, at least in these situations, make use of the
> > > > > > > accelerated paths when the instructions are available, and ignore them when
> > > > > > > they're not, at run time.
> > > > > > > 
> > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > > > > > employs, but implementing that would require some pretty significant work and
> > > > > > > testing.  This seems like a much simpler approach.
> > > > 
> > > > [...]
> > > > 
> > > > > Now, a macro that selected an instruction optimized or generic path is fine, as
> > > > > long as it can happen at run time.  The Linux kernel has such a feature, called
> > > > > alternatives.  But its a complex subsystem that does run time replacement of
> > > > > instructions based on cpu feature flags.  It would be great to have in the DPDK,
> > > > > but its a significant code base and difficult to maintain, which goes against
> > > > > your desire to reduce code.
> > > > 
> > > > [...]
> > > > 
> > > > > > Even though the code is written using intrinsics which correspond to SSE
> > > > > > operations, the compiler is free to use AVX instructions where necessary
> > > > > Not if you use the default machine target.
> > > > > 
> > > > > > to improve performance. Therefore, if we go down this road, we need to
> > > > > > look to compile up the code for all microarchitectures, rather than just
> > > > > > assuming that we will get equivalent performance to "native" by turning
> > > > > > on the instruction set indicated by the primitives in the code. This is
> > > > > No, you compile for the least common demonitor system, and enable more
> > > > > performant paths opportunistically as run time checks allow.
> > > > > 
> > > > > > where having one codepath recompiled multiple times will work far better
> > > > > > than having multiple code paths.
> > > > > Only if you're only concern is performance.  As noted above, my goal is more
> > > > > than just performance, its compatibility accross systems.  Multiple builds for
> > > > > multiple cpu flag availability is simply a non-starter for a generic
> > > > > distribution.
> > > > 
> > > > Neil, we are mixing 2 different problems here.
> > > > 1) we have to fix default build (without SSE-4.2)
> > > Thats nothing to fix, thats a configuration issue.  Just build for a lesser
> > > machine.  I've already done that in the fedora build, using the defalut machine
> > > target.  What exactly is missing from that?
> > > 
> > Re-reading this, I'm wondering if I missed what you were trying to say, if so I
> > apologize.  Were you trying to assert that the right thing to do here is to
> > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that
> > they are buildable on the default platform?  If so, I agree, thats a nice idea,
> > and am supportive of it, though I don't think that fully solves teh problem.  In
> > the case of the ixgbe pmd, what we have is 2 code paths, a generic code path,
> > and an optimized code path using sse4.2 intrinsics.  In this case, I don't think
> > theres anything to fix, in that I'm fine with the optimized path needing sse4.2
> > to execute.  There I just want to be able to do a run time check and use the
> > optimized path if the cpu supports it, and just use the default path otherwise.
> > In effect we already have exactly what you are looking for there.
> > 
> > As far as the ACL library goes, yes, thats more complex.  The use of sse4.2
> > intrinsics there is done througout the code, so theres no easy way to select a
> > path.  we're just left with either using the code or returning an error at run
> > time, as my patch does.  Certainly we can build some macros that either use the
> > intrinsics for sse4.2 or code up some C-level variants of those instructions
> > based on generic code, and build for the least common demoniator, or compile the
> > code twice (once without sse4.2 support, and once with), and do a runtime
> > selection between the two.  Either way, thats going to be a useful, though
> > significant effort.
> 
> I think a good first step here that I can't see anyone objecting to is
> to enable the ixgbe driver to use the vector code path for a generic
> x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> allows a compile for machine type default, and testpmd can still forward
> packets at a good rate (roughly perf down about 10% vs native compile on
> SNB).
> The ACL is a tougher nut to crack, but anyone see any issues with that
> two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> set thread, do you want to submit an official patch here, or would you prefer I
> do so?]
> 

I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
instructions in that path seems significant, isn't it? Given that performance
delta, it seems like it would still be preferable to have a path that used the
sse4.2 instructions when they're available.  Or am I misreading what you mean
when you say down 10%

Neil
 
> > 
> > > > 2) we could try to have performance with default build
> > > > 
> > > Yes, we can, thats what this patch does.  It doesn't address every code path,
> > > no, but it addresses two paths that are low hanging fruit for doing so, and we
> > > can incrementally build on that
> > > 
> > > > Please, let's focus on the first item and we could discuss about performance
> > > > later. Having some different code path choosed at runtime is a big rework and
> > > > imply changing the compilation model (RFC welcome).
> > > > 
> > Even if I misinterpreted your statement above, I'm still not sure why your
> > asserting this. Fixing the build to work with the default target machine is
> > good, and should be undertaken, and I'll happily do so, but why reject the
> > solution in front of you to wait for it?  Even if I write macros to fix up the
> > ACL library, I'd still like to be able to do a run time check and select the
> > optimized version or the generic version based on cpu support.  Just doing a
> > compile time check to determine if sse4.2 is available really isn't going to cut
> > it for me, as I don't want the fedora dpdk to have pessimal performance if it
> > doesn't have to.
> > 
> > Regards
> > Neil
> > 
> 
> With regards to the general approach for runtime detection of software
> functions, I wonder if something like this can be handled by the
> packaging system? Is it possible to ship out a set of shared libs
> compiled up for different instruction sets, and then at rpm install
> time, symlink the appropriate library? This would push the whole issue
> of detection of code paths outside of code, work across all our
> libraries and ensure each user got the best performance they could get
> form a binary?
> Has something like this been done before? The building of all the
> libraries could be scripted easy enough, just do multiple builds using
> different EXTRA_CFLAGS each time, and move and rename the .so's after
> each run.
> 
> /Bruce
> 


More information about the dev mailing list