[dpdk-dev] [PATCH v2 0/5] virtio support for container

Neil Horman nhorman at redhat.com
Wed Mar 23 20:17:43 CET 2016


On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote:
> v1->v2:
>  - Rebase on the patchset of virtio 1.0 support.
>  - Fix cannot create non-hugepage memory.
>  - Fix wrong size of memory region when "single-file" is used.
>  - Fix setting of offset in virtqueue to use virtual address.
>  - Fix setting TUNSETVNETHDRSZ in vhost-user's branch.
>  - Add mac option to specify the mac address of this virtual device.
>  - Update doc.
> 
> This patchset is to provide high performance networking interface (virtio)
> for container-based DPDK applications. The way of starting DPDK apps in
> containers with ownership of NIC devices exclusively is beyond the scope.
> The basic idea here is to present a new virtual device (named eth_cvio),
> which can be discovered and initialized in container-based DPDK apps using
> rte_eal_init(). To minimize the change, we reuse already-existing virtio
> frontend driver code (driver/net/virtio/).
>  
> Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
> operations into unix socket/cuse protocol, which is originally provided in
> QEMU), is integrated in virtio frontend driver. So this converged driver
> actually plays the role of original frontend driver and the role of QEMU
> device framework.
>  
> The major difference lies in how to calculate relative address for vhost.
> The principle of virtio is that: based on one or multiple shared memory
> segments, vhost maintains a reference system with the base addresses and
> length for each segment so that an address from VM comes (usually GPA,
> Guest Physical Address) can be translated into vhost-recognizable address
> (named VVA, Vhost Virtual Address). To decrease the overhead of address
> translation, we should maintain as few segments as possible. In VM's case,
> GPA is always locally continuous. In container's case, CVA (Container
> Virtual Address) can be used. Specifically:
> a. when set_base_addr, CVA address is used;
> b. when preparing RX's descriptors, CVA address is used;
> c. when transmitting packets, CVA is filled in TX's descriptors;
> d. in TX and CQ's header, CVA is used.
>  
> How to share memory? In VM's case, qemu always shares all physical layout
> to backend. But it's not feasible for a container, as a process, to share
> all virtual memory regions to backend. So only specified virtual memory
> regions (with type of shared) are sent to backend. It's a limitation that
> only addresses in these areas can be used to transmit or receive packets.
> 
> Known issues
> 
> a. When used with vhost-net, root privilege is required to create tap
> device inside.
> b. Control queue and multi-queue are not supported yet.
> c. When --single-file option is used, socket_id of the memory may be
> wrong. (Use "numactl -N x -m x" to work around this for now)
>  
> How to use?
> 
> a. Apply this patchset.
> 
> b. To compile container apps:
> $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> 
> c. To build a docker image using Dockerfile below.
> $: cat ./Dockerfile
> FROM ubuntu:latest
> WORKDIR /usr/src/dpdk
> COPY . /usr/src/dpdk
> ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
> $: docker build -t dpdk-app-l2fwd .
> 
> d. Used with vhost-user
> $: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
> 	--socket-mem 1024,1024 -- -p 0x1 --stats 1
> $: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
> 	-v /dev/hugepages:/dev/hugepages \
> 	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> 	--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
> 
> f. Used with vhost-net
> $: modprobe vhost
> $: modprobe vhost-net
> $: docker run -i -t --privileged \
> 	-v /dev/vhost-net:/dev/vhost-net \
> 	-v /dev/net/tun:/dev/net/tun \
> 	-v /dev/hugepages:/dev/hugepages \
> 	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> 	--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
> 
> By the way, it's not necessary to run in a container.
> 
> Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> 
> Jianfeng Tan (5):
>   mem: add --single-file to create single mem-backed file
>   mem: add API to obtain memory-backed file info
>   virtio/vdev: add embeded device emulation
>   virtio/vdev: add a new vdev named eth_cvio
>   docs: add release note for virtio for container
> 
>  config/common_linuxapp                     |   5 +
>  doc/guides/rel_notes/release_2_3.rst       |   4 +
>  drivers/net/virtio/Makefile                |   4 +
>  drivers/net/virtio/vhost.h                 | 194 +++++++
>  drivers/net/virtio/vhost_embedded.c        | 809 +++++++++++++++++++++++++++++
>  drivers/net/virtio/virtio_ethdev.c         | 329 +++++++++---
>  drivers/net/virtio/virtio_ethdev.h         |   6 +-
>  drivers/net/virtio/virtio_pci.h            |  15 +-
>  drivers/net/virtio/virtio_rxtx.c           |   6 +-
>  drivers/net/virtio/virtio_rxtx_simple.c    |  13 +-
>  drivers/net/virtio/virtqueue.h             |  15 +-
>  lib/librte_eal/common/eal_common_options.c |  17 +
>  lib/librte_eal/common/eal_internal_cfg.h   |   1 +
>  lib/librte_eal/common/eal_options.h        |   2 +
>  lib/librte_eal/common/include/rte_memory.h |  16 +
>  lib/librte_eal/linuxapp/eal/eal.c          |   4 +-
>  lib/librte_eal/linuxapp/eal/eal_memory.c   |  88 +++-
>  17 files changed, 1435 insertions(+), 93 deletions(-)
>  create mode 100644 drivers/net/virtio/vhost.h
>  create mode 100644 drivers/net/virtio/vhost_embedded.c
> 
> -- 
> 2.1.4
> 
So, first off, apologies for being so late to review this patch, its been on my
todo list forever, and I've just not gotten to it.

I've taken a cursory look at the code, and I can't find anything glaringly wrong
with it.

That said, I'm a bit confused about the overall purpose of this PMD.  I've read
the description several times now, and I _think_ I understand the purpose and
construction of the PMD. Please correct me if this is not the (admittedly very
generalized) overview:

1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as
a virtual NIC suitable for use in a containerized space

2) The PMD in (1) establishes a connection to the host via the vhost backend
(which is either a socket or a character device), which it uses to forward data
from the containerized dpdk application

3) The system hosting the containerized dpdk application ties the other end of
the tun/tap interface established in (2) to some other forwarding mechanism
(ostensibly a host based dpdk forwarder) to send the frame out on the physical
wire.

If I understand that, it seems reasonable, but I have to ask why?  It feels a
bit like a re-invention of the wheel to me.  That is to say, for whatever
optimization this PMD may have, the by-far larger bottleneck is the tun/tap
interface in step (2).  If thats the case, then why create a new PMD at all? Why
not instead just use a tun/tap interface into the container, along with the
af_packet PMD for communication.  That has the ability to do memory mapping of
an interface for relatively fast packet writes, so I expect it will be just as
performant as this solution, and without the need to write and maintain a new
PMD's worth of code.

I feel like I'm missing something here, so please clarify if I am, but at the
moment, I'm having a hard time seeing the advantage to a new PMD here

Regards
Neil



More information about the dev mailing list