[dpdk-dev] [PATCH v8 00/13] vhost packed ring performance optimization

Maxime Coquelin maxime.coquelin at redhat.com
Thu Oct 24 08:49:39 CEST 2019


I get some checkpatch warnings, and build fails with clang.
Could you please fix these issues and send v9?

Thanks,
Maxime

### [PATCH] vhost: try to unroll for each loop

WARNING:CAMELCASE: Avoid CamelCase: <_Pragma>
#78: FILE: lib/librte_vhost/vhost.h:47:
+#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
4") \

ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
parenthesis
#78: FILE: lib/librte_vhost/vhost.h:47:
+#define vhost_for_each_try_unroll(iter, val, size) _Pragma("GCC unroll
4") \
+	for (iter = val; iter < size; iter++)

ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
parenthesis
#83: FILE: lib/librte_vhost/vhost.h:52:
+#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll 4") \
+	for (iter = val; iter < size; iter++)

ERROR:COMPLEX_MACRO: Macros with complex values should be enclosed in
parenthesis
#88: FILE: lib/librte_vhost/vhost.h:57:
+#define vhost_for_each_try_unroll(iter, val, size) _Pragma("unroll (4)") \
+	for (iter = val; iter < size; iter++)

total: 3 errors, 1 warnings, 67 lines checked

0/1 valid patch/tmp/dpdk_build/lib/librte_vhost/virtio_net.c:2065:1:
error: unused function 'free_zmbuf' [-Werror,-Wunused-function]
free_zmbuf(struct vhost_virtqueue *vq)
^
1 error generated.
make[5]: *** [virtio_net.o] Error 1
make[4]: *** [librte_vhost] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [lib] Error 2
make[2]: *** [all] Error 2
make[1]: *** [pre_install] Error 2
make: *** [install] Error 2


On 10/22/19 12:08 AM, Marvin Liu wrote:
> Packed ring has more compact ring format and thus can significantly
> reduce the number of cache miss. It can lead to better performance.
> This has been approved in virtio user driver, on normal E5 Xeon cpu
> single core performance can raise 12%.
> 
> http://mails.dpdk.org/archives/dev/2018-April/095470.html
> 
> However vhost performance with packed ring performance was decreased.
> Through analysis, mostly extra cost was from the calculating of each
> descriptor flag which depended on ring wrap counter. Moreover, both
> frontend and backend need to write same descriptors which will cause
> cache contention. Especially when doing vhost enqueue function, virtio
> refill packed ring function may write same cache line when vhost doing
> enqueue function. This kind of extra cache cost will reduce the benefit
> of reducing cache misses. 
> 
> For optimizing vhost packed ring performance, vhost enqueue and dequeue
> function will be splitted into fast and normal path.
> 
> Several methods will be taken in fast path:
>   Handle descriptors in one cache line by batch.
>   Split loop function into more pieces and unroll them.
>   Prerequisite check that whether I/O space can copy directly into mbuf
>     space and vice versa. 
>   Prerequisite check that whether descriptor mapping is successful.
>   Distinguish vhost used ring update function by enqueue and dequeue
>     function.
>   Buffer dequeue used descriptors as many as possible.
>   Update enqueue used descriptors by cache line.
> 
> After all these methods done, single core vhost PvP performance with 64B
> packet on Xeon 8180 can boost 35%.
> 
> v8:
> - Allocate mbuf by virtio_dev_pktmbuf_alloc
> 
> v7:
> - Rebase code
> - Rename unroll macro and definitions
> - Calculate flags when doing single dequeue
> 
> v6:
> - Fix dequeue zcopy result check
> 
> v5:
> - Remove disable sw prefetch as performance impact is small
> - Change unroll pragma macro format
> - Rename shadow counter elements names
> - Clean dequeue update check condition
> - Add inline functions replace of duplicated code
> - Unify code style
> 
> v4:
> - Support meson build
> - Remove memory region cache for no clear performance gain and ABI break
> - Not assume ring size is power of two
> 
> v3:
> - Check available index overflow
> - Remove dequeue remained descs number check
> - Remove changes in split ring datapath
> - Call memory write barriers once when updating used flags
> - Rename some functions and macros
> - Code style optimization
> 
> v2:
> - Utilize compiler's pragma to unroll loop, distinguish clang/icc/gcc
> - Buffered dequeue used desc number changed to (RING_SZ - PKT_BURST)
> - Optimize dequeue used ring update when in_order negotiated
> 
> 
> Marvin Liu (13):
>   vhost: add packed ring indexes increasing function
>   vhost: add packed ring single enqueue
>   vhost: try to unroll for each loop
>   vhost: add packed ring batch enqueue
>   vhost: add packed ring single dequeue
>   vhost: add packed ring batch dequeue
>   vhost: flush enqueue updates by cacheline
>   vhost: flush batched enqueue descs directly
>   vhost: buffer packed ring dequeue updates
>   vhost: optimize packed ring enqueue
>   vhost: add packed ring zcopy batch and single dequeue
>   vhost: optimize packed ring dequeue
>   vhost: optimize packed ring dequeue when in-order
> 
>  lib/librte_vhost/Makefile     |  18 +
>  lib/librte_vhost/meson.build  |   7 +
>  lib/librte_vhost/vhost.h      |  57 ++
>  lib/librte_vhost/virtio_net.c | 948 +++++++++++++++++++++++++++-------
>  4 files changed, 837 insertions(+), 193 deletions(-)
> 



More information about the dev mailing list