[dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Liu, Yong yong.liu at intel.com
Thu Oct 24 10:29:04 CEST 2019


Thanks, Maxime. Just sent out v9. 

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Thursday, October 24, 2019 4:25 PM
> To: Liu, Yong <yong.liu at intel.com>; Bie, Tiwei <tiwei.bie at intel.com>; Wang,
> Zhihong <zhihong.wang at intel.com>; stephen at networkplumber.org;
> gavin.hu at arm.com
> Cc: dev at dpdk.org
> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization
> 
> 
> 
> On 10/24/19 9:18 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >> Sent: Thursday, October 24, 2019 2:50 PM
> >> To: Liu, Yong <yong.liu at intel.com>; Bie, Tiwei <tiwei.bie at intel.com>;
> Wang,
> >> Zhihong <zhihong.wang at intel.com>; stephen at networkplumber.org;
> >> gavin.hu at arm.com
> >> Cc: dev at dpdk.org
> >> Subject: Re: [PATCH v8 00/13] vhost packed ring performance optimization
> >>
> >> I get some checkpatch warnings, and build fails with clang.
> >> Could you please fix these issues and send v9?
> >>
> >
> >
> > Hi Maxime,
> > Clang build fails will be fixed in v9. For checkpatch warning, it was due
> to pragma string inside.
> > Previous version can avoid such warning, while format is a little messy
> as below.
> > I prefer to keep code clean and more readable. How about your idea?
> >
> > +#ifdef UNROLL_PRAGMA_PARAM
> > +#define VHOST_UNROLL_PRAGMA(param) _Pragma(param)
> > +#else
> > +#define VHOST_UNROLL_PRAGMA(param) do {} while (0);
> > +#endif
> >
> > +       VHOST_UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
> > +       for (i = 0; i < PACKED_BATCH_SIZE; i++)
> 
> That's less clean indeed. I agree to waive the checkpatch errors.
> just fix the Clang build for patch 8 and we're good.
> 
> Thanks,
> Maxime
> 
> > Regards,
> > Marvin
> >
> >> Thanks,
> >> Maxime
> >>
> >> ### [PATCH] vhost: try to unroll for each loop
> >>
> >> WARNING:CAMELCASE: Avoid CamelCase: <_Pragma>
> >> #78: FILE: lib/librte_vhost/vhost.h:47:
> >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
> >> 4") \
> >>
> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> >> parenthesis
> >> #78: FILE: lib/librte_vhost/vhost.h:47:
> >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
> >> 4") \
> >> +	for (iter = val; iter < size; iter++)
> >>
> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> >> parenthesis
> >> #83: FILE: lib/librte_vhost/vhost.h:52:
> >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4")
> \
> >> +	for (iter = val; iter < size; iter++)
> >>
> >> ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
> >> parenthesis
> >> #88: FILE: lib/librte_vhost/vhost.h:57:
> >> +#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)")
> \
> >> +	for (iter = val; iter < size; iter++)
> >>
> >> total: 3 errors, 1 warnings, 67 lines checked
> >>
> >> 0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1:
> >> error: unused function 'free_zmbuf' [-Werror,-Wunused-function]
> >> free_zmbuf(struct vhost_virtqueue *vq)
> >> ^
> >> 1 error generated.
> >> make[5]: *** [virtio_net.o] Error 1
> >> make[4]: *** [librte_vhost] Error 2
> >> make[4]: *** Waiting for unfinished jobs....
> >> make[3]: *** [lib] Error 2
> >> make[2]: *** [all] Error 2
> >> make[1]: *** [pre_install] Error 2
> >> make: *** [install] Error 2
> >>
> >>
> >> On 10/22/19 12:08 AM, Marvin Liu wrote:
> >>> Packed ring has more compact ring format and thus can significantly
> >>> reduce the number of cache miss. It can lead to better performance.
> >>> This has been approved in virtio user driver, on normal E5 Xeon cpu
> >>> single core performance can raise 12%.
> >>>
> >>> http://mails.dpdk.org/archives/dev/2018-April/095470.html
> >>>
> >>> However vhost performance with packed ring performance was decreased.
> >>> Through analysis, mostly extra cost was from the calculating of each
> >>> descriptor flag which depended on ring wrap counter. Moreover, both
> >>> frontend and backend need to write same descriptors which will cause
> >>> cache contention. Especially when doing vhost enqueue function, virtio
> >>> refill packed ring function may write same cache line when vhost doing
> >>> enqueue function. This kind of extra cache cost will reduce the benefit
> >>> of reducing cache misses.
> >>>
> >>> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> >>> function will be splitted into fast and normal path.
> >>>
> >>> Several methods will be taken in fast path:
> >>>   Handle descriptors in one cache line by batch.
> >>>   Split loop function into more pieces and unroll them.
> >>>   Prerequisite check that whether I/O space can copy directly into mbuf
> >>>     space and vice versa.
> >>>   Prerequisite check that whether descriptor mapping is successful.
> >>>   Distinguish vhost used ring update function by enqueue and dequeue
> >>>     function.
> >>>   Buffer dequeue used descriptors as many as possible.
> >>>   Update enqueue used descriptors by cache line.
> >>>
> >>> After all these methods done, single core vhost PvP performance with
> 64B
> >>> packet on Xeon 8180 can boost 35%.
> >>>
> >>> v8:
> >>> - Allocate mbuf by virtio_dev_pktmbuf_alloc
> >>>
> >>> v7:
> >>> - Rebase code
> >>> - Rename unroll macro and definitions
> >>> - Calculate flags when doing single dequeue
> >>>
> >>> v6:
> >>> - Fix dequeue zcopy result check
> >>>
> >>> v5:
> >>> - Remove disable sw prefetch as performance impact is small
> >>> - Change unroll pragma macro format
> >>> - Rename shadow counter elements names
> >>> - Clean dequeue update check condition
> >>> - Add inline functions replace of duplicated code
> >>> - Unify code style
> >>>
> >>> v4:
> >>> - Support meson build
> >>> - Remove memory region cache for no clear performance gain and ABI
> break
> >>> - Not assume ring size is power of two
> >>>
> >>> v3:
> >>> - Check available index overflow
> >>> - Remove dequeue remained descs number check
> >>> - Remove changes in split ring datapath
> >>> - Call memory write barriers once when updating used flags
> >>> - Rename some functions and macros
> >>> - Code style optimization
> >>>
> >>> v2:
> >>> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> >>> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> >>> - Optimize dequeue used ring update when in_order negotiated
> >>>
> >>>
> >>> Marvin Liu (13):
> >>>   vhost: add packed ring indexes increasing function
> >>>   vhost: add packed ring single enqueue
> >>>   vhost: try to unroll for each loop
> >>>   vhost: add packed ring batch enqueue
> >>>   vhost: add packed ring single dequeue
> >>>   vhost: add packed ring batch dequeue
> >>>   vhost: flush enqueue updates by cacheline
> >>>   vhost: flush batched enqueue descs directly
> >>>   vhost: buffer packed ring dequeue updates
> >>>   vhost: optimize packed ring enqueue
> >>>   vhost: add packed ring zcopy batch and single dequeue
> >>>   vhost: optimize packed ring dequeue
> >>>   vhost: optimize packed ring dequeue when in-order
> >>>
> >>>  lib/librte_vhost/Makefile     |  18 +
> >>>  lib/librte_vhost/meson.build  |   7 +
> >>>  lib/librte_vhost/vhost.h      |  57 ++
> >>>  lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
> >>>  4 files changed, 837 insertions(+), 193 deletions(-)
> >>>
> >



More information about the dev mailing list