[dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants

Neil Horman nhorman at tuxdriver.com
Mon Jul 28 15:32:23 CEST 2014


On Sat, Jul 26, 2014 at 11:15:01AM +0000, Ananyev, Konstantin wrote:
> 
> Hi Neil,
>  
> > The ixgbe pmd currently can't be built without enabling sse instructions at
> > compile time.
> 
> Actually it can, all you have to do is set RTE_IXGBE_INC_VECTOR=n in your config.
> 
> >  While sse extensions provide better performance, theres no reason
> > that we can't still create builds to run on systems that don't support sse.  If
> > we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll
> > functions, I've confirmed that the gcc compiler emits the appropriate sse
> > instructions when the provided -march parameter indicates a machine that
> > includes sse support, and emits generic code when see isn't available.
> 
> I don't think it is ok to blindly replace _mm_shuffle_epi8 with __builtin_shuffle.
> They are not identical.
> I tried your patch on IVB box (gcc 4.8.3, CONFIG_RTE_MACHINE="native").
> The result is - ixgbe_recv_pkts_vec() functionality is broken.
> See below for more details.
> So my vote is NACK.
> Konstantin 
>  
> 1. Code changes:
> uint16_t
> ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>                 uint16_t nb_pkts)
> {
>         ...
>         __m128i shuf_msk;
>         ...
>         /* mask to shuffle from desc. to mbuf */
>         shuf_msk = _mm_set_epi8(
>                 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
>                 0xFF, 0xFF,  /* skip high 16 bits vlan_macip, zero out */
>                 15, 14,      /* octet 14~15, low 16 bits vlan_macip */
>                 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
>                 13, 12,      /* octet 12~13, low 16 bits pkt_len */
>                 0xFF, 0xFF,  /* skip nb_segs and in_port, zero out */
>                 13, 12       /* octet 12~13, 16 bits data_len */
>                 );
>          ...
>          for (...) {
>                 __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
>                 __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
>                 ...
>                 descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3)); 
>                 ...
> -               pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
> +              pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk);
> ...
> 
> 2. Code generated before the patch (valid one):
> ...
> vmovdqa 0x4978d(%rip),%xmm0           /* load shuf_msk */
> ...
> vmovdqu 0x30(%rdx),%xmm4                /* load desc[3] */
> ....
> vpshufb %xmm0,%xmm4,%xmm8
> ....
> 
> 3. Code generated after the patch applied (broken one):
> ...
> vmovdqu 0x30(%rdx),%xmm3
> ...
> vpunpcklqdq %xmm3,%xmm3,%xmm3       /* !!! ERROR - should be vpshufb !!!! */
> 
> 4. What happens here?
> My understanding:
> 
> GCC treats __m128i as vector of two 64bit integers:
> /lib/gcc/x86_64-redhat-linux/4.8.3/include/emmintrin.h:typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
> 
> From https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html:
> "... Vector shuffling is available using functions __builtin_shuffle (vec, mask) and __builtin_shuffle (vec0, vec1, mask). Both functions construct a permutation of elements from one or two vectors and return a vector of the same type as the input vector(s). The mask is an integral vector with the same width (W) and element count (N) as the output vector.
> 
> The elements of the input vectors are numbered in memory ordering of vec0 beginning at 0 and vec1 beginning at N. The elements of mask are considered modulo N in the single-operand case and modulo 2*N in the two-operand case."
> 
> For m128i N = 2, so:
> 
> __m128i x, __m128i msk;
> x = __builtin_shuffle(x, msk);
> 
> means:
> 
> index0 = msk[0..63] % 2;
> index1 = msk[64..127] % 2;
> x[0..63] =  x[index0 * 64..index0*64+63];
> x[64..127] =  x[index1 * 64..index1*64+63];
> 
> In ixgbe_recv_pkts_vec() shuf_msk[0..63] % 2 == 0 and  shuf_msk[64..127] % 2.
> So compiler makes optimisation:
> pkt_mb4[0..63]  = descs[3] [0..63];
> pkt_mb4[64..127]  = descs[3] [0..63];
> i.e:
> vpunpcklqdq %xmm3,%xmm3,%xmm3
> 
> BTW, changing to
> __builtin_shuffle((__v16qi) descs[3], (__v16qi)shuf_msk);
> wouldn't help either.
>  In that case __builtin_shuffle will consider elements of mas modulo 16.
> While _mm_shuffle_epi8 (PSHUFB) is expected to zero destination byte if upper bit in the corresponding mask byte is 1.
> 
Ok, Sorry for the delayed response, I spend the weekend reading up on the
differences between these instructions, and you're right, I thought those
operations were equivalent, but the zeroing operation differentiates them.
Sorry about that.

That said, it would be really helpful for distribution packaging to be able to
enable vectorized reception at run time.  A compile time build variable really
just isn't very helpful.  I'll see if I can rework the patch to allow optional
vectorized patch reception at run time.

Neil

>  
> 


More information about the dev mailing list