[dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost

Burakov, Anatoly anatoly.burakov at intel.com
Mon Jul 5 14:16:03 CEST 2021


On 05-Jul-21 9:40 AM, Xuan Ding wrote:
> The use of IOMMU has many advantages, such as isolation and address
> translation. This patch extends the capbility of DMA engine to use
> IOMMU if the DMA device is bound to vfio.
> 
> When set memory table, the guest memory will be mapped
> into the default container of DPDK.
> 
> Signed-off-by: Xuan Ding <xuan.ding at intel.com>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
>   lib/vhost/rte_vhost.h               |  1 +
>   lib/vhost/socket.c                  |  9 ++++++
>   lib/vhost/vhost.h                   |  1 +
>   lib/vhost/vhost_user.c              | 46 ++++++++++++++++++++++++++++-
>   5 files changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index 05c42c9b11..c3beda23d9 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -118,6 +118,15 @@ The following is an overview of some key Vhost API functions:
>   
>       It is disabled by default.
>   
> +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
> +
> +    In asynchronous data path, vhost liarary is not aware of which driver
> +    (igb_uio/vfio) the DMA device is bound to. Application should pass
> +    this flag to tell vhost library whether IOMMU should be programmed
> +    for guest memory.
> +
> +    It is disabled by default.
> +
>     - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
>   
>       Since v16.04, the vhost library forwards checksum and gso requests for
> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> index 8d875e9322..a766ea7b6b 100644
> --- a/lib/vhost/rte_vhost.h
> +++ b/lib/vhost/rte_vhost.h
> @@ -37,6 +37,7 @@ extern "C" {
>   #define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
>   #define RTE_VHOST_USER_ASYNC_COPY	(1ULL << 7)
>   #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS	(1ULL << 8)
> +#define RTE_VHOST_USER_ASYNC_USE_VFIO	(1ULL << 9)
>   
>   /* Features. */
>   #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
> index 5d0d728d52..77c722c86b 100644
> --- a/lib/vhost/socket.c
> +++ b/lib/vhost/socket.c
> @@ -42,6 +42,7 @@ struct vhost_user_socket {
>   	bool extbuf;
>   	bool linearbuf;
>   	bool async_copy;
> +	bool async_use_vfio;
>   	bool net_compliant_ol_flags;
>   
>   	/*
> @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
>   			dev->async_copy = 1;
>   	}
>   
> +	if (vsocket->async_use_vfio) {
> +		dev = get_device(vid);
> +
> +		if (dev)
> +			dev->async_use_vfio = 1;
> +	}
> +
>   	VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
>   
>   	if (vsocket->notify_ops->new_connection) {
> @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
>   	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
>   	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
>   	vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> +	vsocket->async_use_vfio = flags & RTE_VHOST_USER_ASYNC_USE_VFIO;
>   	vsocket->net_compliant_ol_flags = flags & RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
>   
>   	if (vsocket->async_copy &&
> diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> index 8078ddff79..fb775ce4ed 100644
> --- a/lib/vhost/vhost.h
> +++ b/lib/vhost/vhost.h
> @@ -370,6 +370,7 @@ struct virtio_net {
>   	int16_t			broadcast_rarp;
>   	uint32_t		nr_vring;
>   	int			async_copy;
> +	int			async_use_vfio;
>   	int			extbuf;
>   	int			linearbuf;
>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 8f0eba6412..f3703f2e72 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -45,6 +45,7 @@
>   #include <rte_common.h>
>   #include <rte_malloc.h>
>   #include <rte_log.h>
> +#include <rte_vfio.h>
>   
>   #include "iotlb.h"
>   #include "vhost.h"
> @@ -141,6 +142,36 @@ get_blk_size(int fd)
>   	return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
>   }
>   
> +static int
> +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> +{
> +	int ret = 0;
> +	uint64_t host_iova;
> +	host_iova = rte_mem_virt2iova((void *)(uintptr_t)region->host_user_addr);
> +	if (do_map) {
> +		/* Add mapped region into the default container of DPDK. */
> +		ret = rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +						 region->host_user_addr,
> +						 host_iova,
> +						 region->size);
> +		if (ret) {
> +			VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> +			return ret;
> +		}
> +	} else {
> +		/* Remove mapped region from the default container of DPDK. */
> +		ret = rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> +						   region->host_user_addr,
> +						   host_iova,
> +						   region->size);
> +		if (ret) {
> +			VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
> +			return ret;
> +		}
> +	}
> +	return ret;
> +}

We've been discussing this off list with Xuan, and unfortunately this is 
a blocker for now.

Currently, the x86 IOMMU does not support partial unmap - the segments 
have to be unmapped exactly the same addr/len as they were mapped. We 
also concatenate adjacent mappings to prevent filling up the DMA mapping 
entry table with superfluous entries.

This means that, when two unrelated mappings are contiguous in memory 
(e.g. if you map regions 1 and 2 independently, but they happen to be 
sitting right next to each other in virtual memory), we cannot later 
unmap one of them because, even though these are two separate mappings 
as far as kernel VFIO infrastructure is concerned, the mapping gets 
compacted and looks like one single mapping to VFIO, so DPDK API will 
not let us unmap region 1 without also unmapping region 2.

The proper fix for this problem would be to always map memory 
page-by-page regardless of where it comes from (we already do that for 
internal memory, but not for external). However, the reason this works 
for internal memory is because when mapping internal memory segments, 
*we know the page size*. For external memory segments, there is no such 
guarantee, so we cannot deduce page size for a given memory segment, and 
thus can't map things page-by-page.

So, the proper fix for it would be to add page size to the VFIO DMA API. 
Unfortunately, it probably has to wait until 21.11 because it is an API 
change.

The slightly hacky fix for this would be to forego user mem map 
concatenation and trust that user is not going to do anything stupid, 
and will not spam the VFIO DMA API without reason. I would rather not go 
down this road, but this could be an option in this case.

Thoughts?

-- 
Thanks,
Anatoly


More information about the dev mailing list