[dpdk-dev] [PATCH v2 0/5] virtio support for container

Neil Horman nhorman at tuxdriver.com
Fri Mar 25 12:06:06 CET 2016


On Fri, Mar 25, 2016 at 09:25:49AM +0800, Tan, Jianfeng wrote:
> 
> 
> On 3/24/2016 9:45 PM, Neil Horman wrote:
> >On Thu, Mar 24, 2016 at 11:10:50AM +0800, Tan, Jianfeng wrote:
> >>Hi Neil,
> >>
> >>On 3/24/2016 3:17 AM, Neil Horman wrote:
> >>>On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote:
> >>>>v1->v2:
> >>>>  - Rebase on the patchset of virtio 1.0 support.
> >>>>  - Fix cannot create non-hugepage memory.
> >>>>  - Fix wrong size of memory region when "single-file" is used.
> >>>>  - Fix setting of offset in virtqueue to use virtual address.
> >>>>  - Fix setting TUNSETVNETHDRSZ in vhost-user's branch.
> >>>>  - Add mac option to specify the mac address of this virtual device.
> >>>>  - Update doc.
> >>>>
> >>>>This patchset is to provide high performance networking interface (virtio)
> >>>>for container-based DPDK applications. The way of starting DPDK apps in
> >>>>containers with ownership of NIC devices exclusively is beyond the scope.
> >>>>The basic idea here is to present a new virtual device (named eth_cvio),
> >>>>which can be discovered and initialized in container-based DPDK apps using
> >>>>rte_eal_init(). To minimize the change, we reuse already-existing virtio
> >>>>frontend driver code (driver/net/virtio/).
> >>>>Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
> >>>>operations into unix socket/cuse protocol, which is originally provided in
> >>>>QEMU), is integrated in virtio frontend driver. So this converged driver
> >>>>actually plays the role of original frontend driver and the role of QEMU
> >>>>device framework.
> >>>>The major difference lies in how to calculate relative address for vhost.
> >>>>The principle of virtio is that: based on one or multiple shared memory
> >>>>segments, vhost maintains a reference system with the base addresses and
> >>>>length for each segment so that an address from VM comes (usually GPA,
> >>>>Guest Physical Address) can be translated into vhost-recognizable address
> >>>>(named VVA, Vhost Virtual Address). To decrease the overhead of address
> >>>>translation, we should maintain as few segments as possible. In VM's case,
> >>>>GPA is always locally continuous. In container's case, CVA (Container
> >>>>Virtual Address) can be used. Specifically:
> >>>>a. when set_base_addr, CVA address is used;
> >>>>b. when preparing RX's descriptors, CVA address is used;
> >>>>c. when transmitting packets, CVA is filled in TX's descriptors;
> >>>>d. in TX and CQ's header, CVA is used.
> >>>>How to share memory? In VM's case, qemu always shares all physical layout
> >>>>to backend. But it's not feasible for a container, as a process, to share
> >>>>all virtual memory regions to backend. So only specified virtual memory
> >>>>regions (with type of shared) are sent to backend. It's a limitation that
> >>>>only addresses in these areas can be used to transmit or receive packets.
> >>>>
> >>>>Known issues
> >>>>
> >>>>a. When used with vhost-net, root privilege is required to create tap
> >>>>device inside.
> >>>>b. Control queue and multi-queue are not supported yet.
> >>>>c. When --single-file option is used, socket_id of the memory may be
> >>>>wrong. (Use "numactl -N x -m x" to work around this for now)
> >>>>How to use?
> >>>>
> >>>>a. Apply this patchset.
> >>>>
> >>>>b. To compile container apps:
> >>>>$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>>>$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>>>$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>>>$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> >>>>
> >>>>c. To build a docker image using Dockerfile below.
> >>>>$: cat ./Dockerfile
> >>>>FROM ubuntu:latest
> >>>>WORKDIR /usr/src/dpdk
> >>>>COPY . /usr/src/dpdk
> >>>>ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
> >>>>$: docker build -t dpdk-app-l2fwd .
> >>>>
> >>>>d. Used with vhost-user
> >>>>$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
> >>>>	--socket-mem 1024,1024 -- -p 0x1 --stats 1
> >>>>$: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
> >>>>	-v /dev/hugepages:/dev/hugepages \
> >>>>	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> >>>>	--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
> >>>>
> >>>>f. Used with vhost-net
> >>>>$: modprobe vhost
> >>>>$: modprobe vhost-net
> >>>>$: docker run -i -t --privileged \
> >>>>	-v /dev/vhost-net:/dev/vhost-net \
> >>>>	-v /dev/net/tun:/dev/net/tun \
> >>>>	-v /dev/hugepages:/dev/hugepages \
> >>>>	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> >>>>	--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
> >>>>
> >>>>By the way, it's not necessary to run in a container.
> >>>>
> >>>>Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> >>>>Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> >>>>
> >>>>Jianfeng Tan (5):
> >>>>   mem: add --single-file to create single mem-backed file
> >>>>   mem: add API to obtain memory-backed file info
> >>>>   virtio/vdev: add embeded device emulation
> >>>>   virtio/vdev: add a new vdev named eth_cvio
> >>>>   docs: add release note for virtio for container
> >>>>
> >>>>  config/common_linuxapp                     |   5 +
> >>>>  doc/guides/rel_notes/release_2_3.rst       |   4 +
> >>>>  drivers/net/virtio/Makefile                |   4 +
> >>>>  drivers/net/virtio/vhost.h                 | 194 +++++++
> >>>>  drivers/net/virtio/vhost_embedded.c        | 809 +++++++++++++++++++++++++++++
> >>>>  drivers/net/virtio/virtio_ethdev.c         | 329 +++++++++---
> >>>>  drivers/net/virtio/virtio_ethdev.h         |   6 +-
> >>>>  drivers/net/virtio/virtio_pci.h            |  15 +-
> >>>>  drivers/net/virtio/virtio_rxtx.c           |   6 +-
> >>>>  drivers/net/virtio/virtio_rxtx_simple.c    |  13 +-
> >>>>  drivers/net/virtio/virtqueue.h             |  15 +-
> >>>>  lib/librte_eal/common/eal_common_options.c |  17 +
> >>>>  lib/librte_eal/common/eal_internal_cfg.h   |   1 +
> >>>>  lib/librte_eal/common/eal_options.h        |   2 +
> >>>>  lib/librte_eal/common/include/rte_memory.h |  16 +
> >>>>  lib/librte_eal/linuxapp/eal/eal.c          |   4 +-
> >>>>  lib/librte_eal/linuxapp/eal/eal_memory.c   |  88 +++-
> >>>>  17 files changed, 1435 insertions(+), 93 deletions(-)
> >>>>  create mode 100644 drivers/net/virtio/vhost.h
> >>>>  create mode 100644 drivers/net/virtio/vhost_embedded.c
> >>>>
> >>>>-- 
> >>>>2.1.4
> >>>>
> >>>So, first off, apologies for being so late to review this patch, its been on my
> >>>todo list forever, and I've just not gotten to it.
> >>>
> >>>I've taken a cursory look at the code, and I can't find anything glaringly wrong
> >>>with it.
> >>Thanks very much for reviewing this series.
> >>
> >>>That said, I'm a bit confused about the overall purpose of this PMD.  I've read
> >>>the description several times now, and I _think_ I understand the purpose and
> >>>construction of the PMD. Please correct me if this is not the (admittedly very
> >>>generalized) overview:
> >>>
> >>>1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as
> >>>a virtual NIC suitable for use in a containerized space
> >>>
> >>>2) The PMD in (1) establishes a connection to the host via the vhost backend
> >>>(which is either a socket or a character device), which it uses to forward data
> >>>from the containerized dpdk application
> >>
> >>The socket or the character device is used just for control plane messages
> >>to setting up the datapath. The data does not go through the socket or the
> >>character device.
> >>
> >>>3) The system hosting the containerized dpdk application ties the other end of
> >>>the tun/tap interface established in (2) to some other forwarding mechanism
> >>>(ostensibly a host based dpdk forwarder) to send the frame out on the physical
> >>>wire.
> >>There are two kinds of vhost backend:
> >>(1) vhost-user, no need to leverage a tun/tap. the cvio PMD connects to the
> >>backend socket, and communicate memory region information with the
> >>vhost-user backend (the backend is another DPDK application using vhost PMD
> >>by Tetsuya, or using vhost library like vhost example).
> >>(2) vhost-net, here we need a tun/tap. When we open the /dev/vhost-net char
> >>device, and some ioctl on it, it just starts a kthread (backend). We need an
> >>interface (tun/tap) as an agent to blend into kernel networking, so that the
> >>kthread knows where to send those packets (sent by frontend), and where to
> >>receive packets to send to frontend.
> >>
> >>To be honest, vhost-user is the preferred way to achieve high performance.
> >>As far as vhost-net is concerned, it goes through a kernel network stack,
> >>which is the performance bottleneck.
> >>
> >Sure, that makes sense.  So in the vhost-user case, we just read/write to a
> >shared memory region?  I.e. no user/kernel space transition for the nominal data
> >path?  If thats the case, than thats the piece I'm missing
> >Neil
> 
> Yes, exactly for now (both sides is in polling mode). Plus, we are trying to
> add interrupt mode so that large amount of containers can run with this new
> PMD. At interrupt mode, "user/kernel transition" would be smart because its
> the other side's responsibility to tell this side if the other side needs to
> be waken up, so user/kernel space transition happens only wakeup is
> necessary.
> 
> Thanks,
> Jianfeng
> 

Ok, thank you for the clarification

Acked-By: Neil Horman <nhorman at tuxdrver.com>


More information about the dev mailing list