[dpdk-dev] [RFC 0/5] virtio support for container

Zhuangyanying ann.zhuangyanying at huawei.com
Tue Nov 24 04:53:00 CET 2015



> -----Original Message-----
> From: Jianfeng Tan [mailto:jianfeng.tan at intel.com]
> Sent: Friday, November 06, 2015 2:31 AM
> To: dev at dpdk.org
> Cc: mst at redhat.com; mukawa at igel.co.jp; nakajima.yoshihiro at lab.ntt.co.jp;
> michael.qiu at intel.com; Guohongzhen; Zhoujingbin; Zhuangyanying; Zhangbo
> (Oscar); gaoxiaoqiu; Zhbzg; huawei.xie at intel.com; Jianfeng Tan
> Subject: [RFC 0/5] virtio support for container
> 
> This patchset only acts as a PoC to request the community for comments.
> 
> This patchset is to provide high performance networking interface
> (virtio) for container-based DPDK applications. The way of starting DPDK
> applications in containers with ownership of NIC devices exclusively is beyond
> the scope. The basic idea here is to present a new virtual device (named
> eth_cvio), which can be discovered and initialized in container-based DPDK
> applications rte_eal_init().
> To minimize the change, we reuse already-existing virtio frontend driver code
> (driver/net/virtio/).
> 
> Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
> operations into unix socket/cuse protocol, which is originally provided in QEMU),
> is integrated in virtio frontend driver. Aka, this new converged driver actually
> plays the role of original frontend driver and the role of QEMU device
> framework.
> 
> The biggest difference here lies in how to calculate relative address for backend.
> The principle of virtio is that: based on one or multiple shared memory
> segments, vhost maintains a reference system with the base addresses and
> length of these segments so that an address from VM comes (usually GPA,
> Guest Physical Address), vhost can translate it into self-recognizable address
> (aka VVA, Vhost Virtual Address). To decrease the overhead of address
> translation, we should maintain as few segments as better. In the context of
> virtual machines, GPA is always locally continuous. So it's a good choice. In
> container's case, CVA (Container Virtual Address) can be used. This means
> that:
> a. when set_base_addr, CVA address is used; b. when preparing RX's
> descriptors, CVA address is used; c. when transmitting packets, CVA is filled in
> TX's descriptors; d. in TX and CQ's header, CVA is used.
> 
> How to share memory? In VM's case, qemu always shares all physical layout to
> backend. But it's not feasible for a container, as a process, to share all virtual
> memory regions to backend. So only specified virtual memory regions (type is
> shared) are sent to backend. It leads to a limitation that only addresses in
> these areas can be used to transmit or receive packets. For now, the shared
> memory is created in /dev/shm using shm_open() in the memory initialization
> process.
> 
> How to use?
> 
> a. Apply the patch of virtio for container. We need two copies of patched code
> (referred as dpdk-app/ and dpdk-vhost/)
> 
> b. To compile container apps:
> $: cd dpdk-app
> $: vim config/common_linuxapp (uncomment "CONFIG_RTE_VIRTIO_VDEV=y")
> $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> 
> c. To build a docker image using Dockerfile below.
> $: cat ./Dockerfile
> FROM ubuntu:latest
> WORKDIR /usr/src/dpdk
> COPY . /usr/src/dpdk
> CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0xc", "-n", "4",
> "--no-huge", "--no-pci",
> "--vdev=eth_cvio0,queue_num=256,rx=1,tx=1,cq=0,path=/var/run/usvhost",
> "--", "-p", "0x1"]
> $: docker build -t dpdk-app-l2fwd .
> 
> d. To compile vhost:
> $: cd dpdk-vhost
> $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> 
> e. Start vhost-switch
> $: ./examples/vhost/build/vhost-switch -c 3 -n 4 --socket-mem 1024,1024 -- -p
> 0x1 --stats 1
> 
> f. Start docker
> $: docker run -i -t -v <path to vhost unix socket>:/var/run/usvhost
> dpdk-app-l2fwd
> 
> Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> 
> Jianfeng Tan (5):
>   virtio/container: add handler for ioport rd/wr
>   virtio/container: add a new virtual device named eth_cvio
>   virtio/container: unify desc->addr assignment
>   virtio/container: adjust memory initialization process
>   vhost/container: change mode of vhost listening socket
> 
>  config/common_linuxapp                       |   5 +
>  drivers/net/virtio/Makefile                  |   4 +
>  drivers/net/virtio/vhost-user.c              | 433
> +++++++++++++++++++++++++++
>  drivers/net/virtio/vhost-user.h              | 137 +++++++++
>  drivers/net/virtio/virtio_ethdev.c           | 319 +++++++++++++++-----
>  drivers/net/virtio/virtio_ethdev.h           |  16 +
>  drivers/net/virtio/virtio_pci.h              |  32 +-
>  drivers/net/virtio/virtio_rxtx.c             |   9 +-
>  drivers/net/virtio/virtio_rxtx_simple.c      |   9 +-
>  drivers/net/virtio/virtqueue.h               |   9 +-
>  lib/librte_eal/common/include/rte_memory.h   |   5 +
>  lib/librte_eal/linuxapp/eal/eal_memory.c     |  58 +++-
>  lib/librte_mempool/rte_mempool.c             |  16 +-
>  lib/librte_vhost/vhost_user/vhost-net-user.c |   5 +
>  14 files changed, 967 insertions(+), 90 deletions(-)  create mode 100644
> drivers/net/virtio/vhost-user.c  create mode 100644
> drivers/net/virtio/vhost-user.h
> 
> --
> 2.1.4

This patch arose a good idea to add an extra abstracted IO layer,  which would make it simple to extend the function to the kernel mode switch(such as OVS). That's great.
But I have one question here: 
    it's the issue on VHOST_USER_SET_MEM_TABLE. you alloc memory from tmpfs filesyste, just one fd, could used rte_memseg_info_get() to 
	directly get the memory topology, However, things change in kernel-space, because mempool should be created on each container's
	hugetlbfs(rather than tmpfs), which is seperated from each other, at last, considering of the ioctl's parameter. 
       My solution is as follows for your reference:
/*
	reg = mem->regions;
	reg->guest_phys_addr = (__u64) ((struct virtqueue *)(dev->data->rx_queues[0]))->mpool->elt_va_start;
	reg->userspace_addr = reg->guest_phys_addr;
	reg->memory_size = ((struct virtqueue *)(dev->data->rx_queues[0]))->mpool->elt_va_end - reg->guest_phys_addr;

	reg = mem->regions + 1;
	reg->guest_phys_addr = (__u64)(((struct virtqueue *)(dev->data->tx_queues[0]))->virtio_net_hdr_mem);
	reg->userspace_addr = reg->guest_phys_addr;
	reg->memory_size = vq_size * internals->vtnet_hdr_size;
*/	  
	   But it's a little ugly, any better idea?



More information about the dev mailing list