[dpdk-dev] [PATCH v2 0/5] virtio support for container

Neil Horman nhorman at tuxdriver.com
Thu Mar 24 14:45:40 CET 2016


On Thu, Mar 24, 2016 at 11:10:50AM +0800, Tan, Jianfeng wrote:
> Hi Neil,
> 
> On 3/24/2016 3:17 AM, Neil Horman wrote:
> >On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote:
> >>v1->v2:
> >>  - Rebase on the patchset of virtio 1.0 support.
> >>  - Fix cannot create non-hugepage memory.
> >>  - Fix wrong size of memory region when "single-file" is used.
> >>  - Fix setting of offset in virtqueue to use virtual address.
> >>  - Fix setting TUNSETVNETHDRSZ in vhost-user's branch.
> >>  - Add mac option to specify the mac address of this virtual device.
> >>  - Update doc.
> >>
> >>This patchset is to provide high performance networking interface (virtio)
> >>for container-based DPDK applications. The way of starting DPDK apps in
> >>containers with ownership of NIC devices exclusively is beyond the scope.
> >>The basic idea here is to present a new virtual device (named eth_cvio),
> >>which can be discovered and initialized in container-based DPDK apps using
> >>rte_eal_init(). To minimize the change, we reuse already-existing virtio
> >>frontend driver code (driver/net/virtio/).
> >>Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
> >>operations into unix socket/cuse protocol, which is originally provided in
> >>QEMU), is integrated in virtio frontend driver. So this converged driver
> >>actually plays the role of original frontend driver and the role of QEMU
> >>device framework.
> >>The major difference lies in how to calculate relative address for vhost.
> >>The principle of virtio is that: based on one or multiple shared memory
> >>segments, vhost maintains a reference system with the base addresses and
> >>length for each segment so that an address from VM comes (usually GPA,
> >>Guest Physical Address) can be translated into vhost-recognizable address
> >>(named VVA, Vhost Virtual Address). To decrease the overhead of address
> >>translation, we should maintain as few segments as possible. In VM's case,
> >>GPA is always locally continuous. In container's case, CVA (Container
> >>Virtual Address) can be used. Specifically:
> >>a. when set_base_addr, CVA address is used;
> >>b. when preparing RX's descriptors, CVA address is used;
> >>c. when transmitting packets, CVA is filled in TX's descriptors;
> >>d. in TX and CQ's header, CVA is used.
> >>How to share memory? In VM's case, qemu always shares all physical layout
> >>to backend. But it's not feasible for a container, as a process, to share
> >>all virtual memory regions to backend. So only specified virtual memory
> >>regions (with type of shared) are sent to backend. It's a limitation that
> >>only addresses in these areas can be used to transmit or receive packets.
> >>
> >>Known issues
> >>
> >>a. When used with vhost-net, root privilege is required to create tap
> >>device inside.
> >>b. Control queue and multi-queue are not supported yet.
> >>c. When --single-file option is used, socket_id of the memory may be
> >>wrong. (Use "numactl -N x -m x" to work around this for now)
> >>How to use?
> >>
> >>a. Apply this patchset.
> >>
> >>b. To compile container apps:
> >>$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>
> >>c. To build a docker image using Dockerfile below.
> >>$: cat ./Dockerfile
> >>FROM ubuntu:latest
> >>WORKDIR /usr/src/dpdk
> >>COPY . /usr/src/dpdk
> >>ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
> >>$: docker build -t dpdk-app-l2fwd .
> >>
> >>d. Used with vhost-user
> >>$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
> >>	--socket-mem 1024,1024 -- -p 0x1 --stats 1
> >>$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
> >>	-v /dev/hugepages:/dev/hugepages \
> >>	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> >>	--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
> >>
> >>f. Used with vhost-net
> >>$: modprobe vhost
> >>$: modprobe vhost-net
> >>$: docker run -i -t --privileged \
> >>	-v /dev/vhost-net:/dev/vhost-net \
> >>	-v /dev/net/tun:/dev/net/tun \
> >>	-v /dev/hugepages:/dev/hugepages \
> >>	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> >>	--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
> >>
> >>By the way, it's not necessary to run in a container.
> >>
> >>Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> >>Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> >>
> >>Jianfeng Tan (5):
> >>   mem: add --single-file to create single mem-backed file
> >>   mem: add API to obtain memory-backed file info
> >>   virtio/vdev: add embeded device emulation
> >>   virtio/vdev: add a new vdev named eth_cvio
> >>   docs: add release note for virtio for container
> >>
> >>  config/common_linuxapp                     |   5 +
> >>  doc/guides/rel_notes/release_2_3.rst       |   4 +
> >>  drivers/net/virtio/Makefile                |   4 +
> >>  drivers/net/virtio/vhost.h                 | 194 +++++++
> >>  drivers/net/virtio/vhost_embedded.c        | 809 +++++++++++++++++++++++++++++
> >>  drivers/net/virtio/virtio_ethdev.c         | 329 +++++++++---
> >>  drivers/net/virtio/virtio_ethdev.h         |   6 +-
> >>  drivers/net/virtio/virtio_pci.h            |  15 +-
> >>  drivers/net/virtio/virtio_rxtx.c           |   6 +-
> >>  drivers/net/virtio/virtio_rxtx_simple.c    |  13 +-
> >>  drivers/net/virtio/virtqueue.h             |  15 +-
> >>  lib/librte_eal/common/eal_common_options.c |  17 +
> >>  lib/librte_eal/common/eal_internal_cfg.h   |   1 +
> >>  lib/librte_eal/common/eal_options.h        |   2 +
> >>  lib/librte_eal/common/include/rte_memory.h |  16 +
> >>  lib/librte_eal/linuxapp/eal/eal.c          |   4 +-
> >>  lib/librte_eal/linuxapp/eal/eal_memory.c   |  88 +++-
> >>  17 files changed, 1435 insertions(+), 93 deletions(-)
> >>  create mode 100644 drivers/net/virtio/vhost.h
> >>  create mode 100644 drivers/net/virtio/vhost_embedded.c
> >>
> >>-- 
> >>2.1.4
> >>
> >So, first off, apologies for being so late to review this patch, its been on my
> >todo list forever, and I've just not gotten to it.
> >
> >I've taken a cursory look at the code, and I can't find anything glaringly wrong
> >with it.
> 
> Thanks very much for reviewing this series.
> 
> >
> >That said, I'm a bit confused about the overall purpose of this PMD.  I've read
> >the description several times now, and I _think_ I understand the purpose and
> >construction of the PMD. Please correct me if this is not the (admittedly very
> >generalized) overview:
> >
> >1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as
> >a virtual NIC suitable for use in a containerized space
> >
> >2) The PMD in (1) establishes a connection to the host via the vhost backend
> >(which is either a socket or a character device), which it uses to forward data
> >from the containerized dpdk application
> 
> The socket or the character device is used just for control plane messages
> to setting up the datapath. The data does not go through the socket or the
> character device.
> 
> >
> >3) The system hosting the containerized dpdk application ties the other end of
> >the tun/tap interface established in (2) to some other forwarding mechanism
> >(ostensibly a host based dpdk forwarder) to send the frame out on the physical
> >wire.
> 
> There are two kinds of vhost backend:
> (1) vhost-user, no need to leverage a tun/tap. the cvio PMD connects to the
> backend socket, and communicate memory region information with the
> vhost-user backend (the backend is another DPDK application using vhost PMD
> by Tetsuya, or using vhost library like vhost example).
> (2) vhost-net, here we need a tun/tap. When we open the /dev/vhost-net char
> device, and some ioctl on it, it just starts a kthread (backend). We need an
> interface (tun/tap) as an agent to blend into kernel networking, so that the
> kthread knows where to send those packets (sent by frontend), and where to
> receive packets to send to frontend.
> 
> To be honest, vhost-user is the preferred way to achieve high performance.
> As far as vhost-net is concerned, it goes through a kernel network stack,
> which is the performance bottleneck.
> 
Sure, that makes sense.  So in the vhost-user case, we just read/write to a
shared memory region?  I.e. no user/kernel space transition for the nominal data
path?  If thats the case, than thats the piece I'm missing
Neil



More information about the dev mailing list