[dpdk-dev] [PATCH v6 00/13] vhost packed ring performance optimization

Liu, Yong yong.liu at intel.com
Thu Oct 17 09:32:52 CEST 2019



> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Thursday, October 17, 2019 3:31 PM
> To: Liu, Yong <yong.liu at intel.com>; Bie, Tiwei <tiwei.bie at intel.com>; Wang,
> Zhihong <zhihong.wang at intel.com>; stephen at networkplumber.org;
> gavin.hu at arm.com
> Cc: dev at dpdk.org
> Subject: Re: [PATCH v6 00/13] vhost packed ring performance optimization
> 
> Hi Marvin,
> 
> This is almost good, just fix the small comments I made.
> 
> Also, please rebase on top of next-virtio branch, because I applied
> below patch from Flavio that you need to take into account:
> 
> http://patches.dpdk.org/patch/61284/

Thanks, Maxime. I will start rebasing work.

> 
> Regards,
> Maxime
> 
> On 10/15/19 6:07 PM, Marvin Liu wrote:
> > Packed ring has more compact ring format and thus can significantly
> > reduce the number of cache miss. It can lead to better performance.
> > This has been approved in virtio user driver, on normal E5 Xeon cpu
> > single core performance can raise 12%.
> >
> > http://mails.dpdk.org/archives/dev/2018-April/095470.html
> >
> > However vhost performance with packed ring performance was decreased.
> > Through analysis, mostly extra cost was from the calculating of each
> > descriptor flag which depended on ring wrap counter. Moreover, both
> > frontend and backend need to write same descriptors which will cause
> > cache contention. Especially when doing vhost enqueue function, virtio
> > refill packed ring function may write same cache line when vhost doing
> > enqueue function. This kind of extra cache cost will reduce the benefit
> > of reducing cache misses.
> >
> > For optimizing vhost packed ring performance, vhost enqueue and dequeue
> > function will be splitted into fast and normal path.
> >
> > Several methods will be taken in fast path:
> >   Handle descriptors in one cache line by batch.
> >   Split loop function into more pieces and unroll them.
> >   Prerequisite check that whether I/O space can copy directly into mbuf
> >     space and vice versa.
> >   Prerequisite check that whether descriptor mapping is successful.
> >   Distinguish vhost used ring update function by enqueue and dequeue
> >     function.
> >   Buffer dequeue used descriptors as many as possible.
> >   Update enqueue used descriptors by cache line.
> >
> > After all these methods done, single core vhost PvP performance with 64B
> > packet on Xeon 8180 can boost 35%.
> >
> > v6:
> > - Fix dequeue zcopy result check
> >
> > v5:
> > - Remove disable sw prefetch as performance impact is small
> > - Change unroll pragma macro format
> > - Rename shadow counter elements names
> > - Clean dequeue update check condition
> > - Add inline functions replace of duplicated code
> > - Unify code style
> >
> > v4:
> > - Support meson build
> > - Remove memory region cache for no clear performance gain and ABI break
> > - Not assume ring size is power of two
> >
> > v3:
> > - Check available index overflow
> > - Remove dequeue remained descs number check
> > - Remove changes in split ring datapath
> > - Call memory write barriers once when updating used flags
> > - Rename some functions and macros
> > - Code style optimization
> >
> > v2:
> > - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> > - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> > - Optimize dequeue used ring update when in_order negotiated
> >
> >
> > Marvin Liu (13):
> >   vhost: add packed ring indexes increasing function
> >   vhost: add packed ring single enqueue
> >   vhost: try to unroll for each loop
> >   vhost: add packed ring batch enqueue
> >   vhost: add packed ring single dequeue
> >   vhost: add packed ring batch dequeue
> >   vhost: flush enqueue updates by batch
> >   vhost: flush batched enqueue descs directly
> >   vhost: buffer packed ring dequeue updates
> >   vhost: optimize packed ring enqueue
> >   vhost: add packed ring zcopy batch and single dequeue
> >   vhost: optimize packed ring dequeue
> >   vhost: optimize packed ring dequeue when in-order
> >
> >  lib/librte_vhost/Makefile     |  18 +
> >  lib/librte_vhost/meson.build  |   7 +
> >  lib/librte_vhost/vhost.h      |  57 +++
> >  lib/librte_vhost/virtio_net.c | 924 +++++++++++++++++++++++++++-------
> >  4 files changed, 812 insertions(+), 194 deletions(-)
> >


More information about the dev mailing list