[dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue

Liu, Yong yong.liu at intel.com
Mon Sep 21 08:26:55 CEST 2020
Previous message: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
Next message: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin at redhat.com>
> Sent: Friday, September 18, 2020 9:45 PM
> To: Liu, Yong <yong.liu at intel.com>; Xia, Chenbo <chenbo.xia at intel.com>;
> Wang, Zhihong <zhihong.wang at intel.com>
> Cc: dev at dpdk.org
> Subject: Re: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
> 
> 
> 
> On 8/19/20 5:24 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu at intel.com>
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 4f2f3e47da..c0cd7d498f 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> >  endif
> >  endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > +        CC_AVX512_SUPPORT=\
> > +        $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > +        sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > +        grep -q AVX512 && echo 1)
> > +endif
> > +
> >  ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> >  LDLIBS += -lnuma
> >  endif
> > @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> lrte_ethdev -lrte_net
> >  SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
> >  					vhost_user.c virtio_net.c vdpa.c
> >
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS += -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> > +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> > +endif
> > +
> >  # install includes
> >  SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h \
> >  						rte_vdpa_dev.h
> rte_vhost_async.h
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> >  if has_libnuma == 1
> >  	dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> >  endif
> > +
> > +if arch_subdir == 'x86'
> > +        if not machine_args.contains('-mno-avx512f')
> > +                if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > +                        cflags += ['-DCC_AVX512_SUPPORT']
> > +                        vhost_avx512_lib = static_library('vhost_avx512_lib',
> > +                                              'vhost_vec_avx.c',
> > +                                              dependencies: [static_rte_eal,
> static_rte_mempool,
> > +                                                  static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > +                                              include_directories: includes,
> > +                                              c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > +                        objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > +                endif
> > +        endif
> > +endif
> > +
> >  if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> >  	cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> >  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> >  	return NULL;
> >  }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids);
> >  #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..e8361d18fa
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -0,0 +1,152 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > +	offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > +	1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > +	sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > +	sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2)  |
> \
> > +	1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > +	offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > +	<< FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > +	FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > +				 struct vhost_virtqueue *vq,
> > +				 struct rte_mempool *mbuf_pool,
> > +				 struct rte_mbuf **pkts,
> > +				 uint16_t avail_idx,
> > +				 uintptr_t *desc_addrs,
> > +				 uint16_t *ids)
> > +{
> > +	struct vring_packed_desc *descs = vq->desc_packed;
> > +	uint32_t descs_status;
> > +	void *desc_addr;
> > +	uint16_t i;
> > +	uint8_t cmp_low, cmp_high, cmp_result;
> > +	uint64_t lens[PACKED_BATCH_SIZE];
> > +
> > +	if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > +		return -1;
> > +
> > +	/* load 4 descs */
> > +	desc_addr = &vq->desc_packed[avail_idx];
> > +	__m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > +	/* burst check four status */
> > +	__m512i avail_flag_vec;
> > +	if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > +					PACKED_FLAGS_MASK, 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > +	else
> > +#if defined(RTE_ARCH_I686)
> > +		avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > +					0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > +		avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > +					PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > +	descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > +		_MM_CMPINT_NE);
> > +	if (descs_status & BATCH_FLAGS_MASK)
> > +		return -1;
> > +
> > +	/* check buffer fit into one region & translate address */
> > +	__m512i regions_low_addrs =
> > +		_mm512_loadu_si512((void *)&dev->regions_low_addrs);
> > +	__m512i regions_high_addrs =
> > +		_mm512_loadu_si512((void *)&dev->regions_high_addrs);
> > +	vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > +		uint64_t addr_low = descs[avail_idx + i].addr;
> > +		uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> > +		__m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> > +		__m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> > +
> > +		cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> > +				regions_low_addrs, _MM_CMPINT_NLT);
> > +		cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> > +				regions_high_addrs, _MM_CMPINT_LT);
> > +		cmp_result = cmp_low & cmp_high;
> > +		int index = __builtin_ctz(cmp_result);
> > +		if (unlikely((uint32_t)index >= dev->mem->nregions))
> > +			goto free_buf;
> > +
> > +		desc_addrs[i] = addr_low +
> > +			dev->mem->regions[index].host_user_addr -
> > +			dev->mem->regions[index].guest_phys_addr;
> > +		lens[i] = descs[avail_idx + i].len;
> > +		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > +		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> > +		if (!pkts[i])
> > +			goto free_buf;
> > +	}
> 
> The above does not support vIOMMU, isn't it?
> 
> The more the packed datapath evolves, the more it gets optimized for a
> very specific configuration.
> 
> In v19.11, indirect descriptors and chained buffers are handled as a
> fallback. And now vIOMMU support is handled as a fallback.
> 

Hi Maxime,
Thanks for figuring out the feature miss. First version patch is lack of vIOMMU supporting.
V2 patch will fix the feature gap between vectorized function and original batch function.
So there will be no additional fallback introduced in vectorized patch set.

IMHO, current packed optimization introduced complexity is for handling that gap between performance aimed frontend (like PMD) and normal network traffic (like TCP).
Vectorized datapath is focusing in enhancing the performance of batched function.  From function point of view, there will no difference between vectorized batched function and original batched function. 
Current packed ring path will remain the same if vectorized option is not enable. So I think the complexity won't increase too much. If there's any concern, please let me known. 

BTW, vectorized path can help performance a lot when vIOMMU enabled. 

Regards,
Marvin

> I personnally don't like the path it is taking as it is adding a lot of
> complexity on top of that.
>
Previous message: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
Next message: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list