[dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost

Bruce Richardson bruce.richardson at intel.com
Wed Jul 7 17:00:07 CEST 2021


On Wed, Jul 07, 2021 at 03:33:44PM +0100, Burakov, Anatoly wrote:
> On 07-Jul-21 1:54 PM, Ding, Xuan wrote:
> > Hi Anatoly,
> > 
> > > -----Original Message-----
> > > From: Burakov, Anatoly <anatoly.burakov at intel.com>
> > > Sent: Wednesday, July 7, 2021 8:18 PM
> > > To: Ding, Xuan <xuan.ding at intel.com>; Maxime Coquelin
> > > <maxime.coquelin at redhat.com>; Xia, Chenbo <chenbo.xia at intel.com>;
> > > Thomas Monjalon <thomas at monjalon.net>; David Marchand
> > > <david.marchand at redhat.com>
> > > Cc: dev at dpdk.org; Hu, Jiayu <jiayu.hu at intel.com>; Pai G, Sunil
> > > <sunil.pai.g at intel.com>; Richardson, Bruce <bruce.richardson at intel.com>; Van
> > > Haaren, Harry <harry.van.haaren at intel.com>; Liu, Yong <yong.liu at intel.com>;
> > > Ma, WenwuX <wenwux.ma at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> > > 
> > > On 07-Jul-21 7:25 AM, Ding, Xuan wrote:
> > > > Hi,
> > > > 
> > > > > -----Original Message-----
> > > > > From: Maxime Coquelin <maxime.coquelin at redhat.com>
> > > > > Sent: Tuesday, July 6, 2021 5:32 PM
> > > > > To: Burakov, Anatoly <anatoly.burakov at intel.com>; Ding, Xuan
> > > > > <xuan.ding at intel.com>; Xia, Chenbo <chenbo.xia at intel.com>; Thomas
> > > > > Monjalon <thomas at monjalon.net>; David Marchand
> > > > > <david.marchand at redhat.com>
> > > > > Cc: dev at dpdk.org; Hu, Jiayu <jiayu.hu at intel.com>; Pai G, Sunil
> > > > > <sunil.pai.g at intel.com>; Richardson, Bruce <bruce.richardson at intel.com>;
> > > Van
> > > > > Haaren, Harry <harry.van.haaren at intel.com>; Liu, Yong
> > > <yong.liu at intel.com>;
> > > > > Ma, WenwuX <wenwux.ma at intel.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
> > > > > 
> > > > > 
> > > > > 
> > > > > On 7/6/21 11:16 AM, Burakov, Anatoly wrote:
> > > > > > On 06-Jul-21 9:31 AM, Ding, Xuan wrote:
> > > > > > > Hi Maxime,
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Maxime Coquelin <maxime.coquelin at redhat.com>
> > > > > > > > Sent: Monday, July 5, 2021 8:46 PM
> > > > > > > > To: Burakov, Anatoly <anatoly.burakov at intel.com>; Ding, Xuan
> > > > > > > > <xuan.ding at intel.com>; Xia, Chenbo <chenbo.xia at intel.com>; Thomas
> > > > > > > > Monjalon <thomas at monjalon.net>; David Marchand
> > > > > > > > <david.marchand at redhat.com>
> > > > > > > > Cc: dev at dpdk.org; Hu, Jiayu <jiayu.hu at intel.com>; Pai G, Sunil
> > > > > > > > <sunil.pai.g at intel.com>; Richardson, Bruce
> > > <bruce.richardson at intel.com>;
> > > > > > > > Van Haaren, Harry <harry.van.haaren at intel.com>; Liu, Yong
> > > > > > > > <yong.liu at intel.com>; Ma, WenwuX <wenwux.ma at intel.com>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
> > > > > > > > vhost
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 7/5/21 2:16 PM, Burakov, Anatoly wrote:
> > > > > > > > > On 05-Jul-21 9:40 AM, Xuan Ding wrote:
> > > > > > > > > > The use of IOMMU has many advantages, such as isolation and address
> > > > > > > > > > translation. This patch extends the capbility of DMA engine to use
> > > > > > > > > > IOMMU if the DMA device is bound to vfio.
> > > > > > > > > > 
> > > > > > > > > > When set memory table, the guest memory will be mapped
> > > > > > > > > > into the default container of DPDK.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xuan Ding <xuan.ding at intel.com>
> > > > > > > > > > ---
> > > > > > > > > >      doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
> > > > > > > > > >      lib/vhost/rte_vhost.h               |  1 +
> > > > > > > > > >      lib/vhost/socket.c                  |  9 ++++++
> > > > > > > > > >      lib/vhost/vhost.h                   |  1 +
> > > > > > > > > >      lib/vhost/vhost_user.c              | 46
> > > > > > > > > > ++++++++++++++++++++++++++++-
> > > > > > > > > >      5 files changed, 65 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > b/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > index 05c42c9b11..c3beda23d9 100644
> > > > > > > > > > --- a/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > +++ b/doc/guides/prog_guide/vhost_lib.rst
> > > > > > > > > > @@ -118,6 +118,15 @@ The following is an overview of some key
> > > Vhost
> > > > > > > > > > API functions:
> > > > > > > > > >            It is disabled by default.
> > > > > > > > > >      +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
> > > > > > > > > > +
> > > > > > > > > > +    In asynchronous data path, vhost liarary is not aware of which
> > > > > > > > > > driver
> > > > > > > > > > +    (igb_uio/vfio) the DMA device is bound to. Application should
> > > > > > > > > > pass
> > > > > > > > > > +    this flag to tell vhost library whether IOMMU should be
> > > > > > > > > > programmed
> > > > > > > > > > +    for guest memory.
> > > > > > > > > > +
> > > > > > > > > > +    It is disabled by default.
> > > > > > > > > > +
> > > > > > > > > >        - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
> > > > > > > > > >            Since v16.04, the vhost library forwards checksum and gso
> > > > > > > > > > requests for
> > > > > > > > > > diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> > > > > > > > > > index 8d875e9322..a766ea7b6b 100644
> > > > > > > > > > --- a/lib/vhost/rte_vhost.h
> > > > > > > > > > +++ b/lib/vhost/rte_vhost.h
> > > > > > > > > > @@ -37,6 +37,7 @@ extern "C" {
> > > > > > > > > >      #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
> > > > > > > > > >      #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
> > > > > > > > > >      #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL <<
> > > 8)
> > > > > > > > > > +#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
> > > > > > > > > >        /* Features. */
> > > > > > > > > >      #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
> > > > > > > > > > diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
> > > > > > > > > > index 5d0d728d52..77c722c86b 100644
> > > > > > > > > > --- a/lib/vhost/socket.c
> > > > > > > > > > +++ b/lib/vhost/socket.c
> > > > > > > > > > @@ -42,6 +42,7 @@ struct vhost_user_socket {
> > > > > > > > > >          bool extbuf;
> > > > > > > > > >          bool linearbuf;
> > > > > > > > > >          bool async_copy;
> > > > > > > > > > +    bool async_use_vfio;
> > > > > > > > > >          bool net_compliant_ol_flags;
> > > > > > > > > >            /*
> > > > > > > > > > @@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct
> > > > > > > > > > vhost_user_socket *vsocket)
> > > > > > > > > >                  dev->async_copy = 1;
> > > > > > > > > >          }
> > > > > > > > > >      +    if (vsocket->async_use_vfio) {
> > > > > > > > > > +        dev = get_device(vid);
> > > > > > > > > > +
> > > > > > > > > > +        if (dev)
> > > > > > > > > > +            dev->async_use_vfio = 1;
> > > > > > > > > > +    }
> > > > > > > > > > +
> > > > > > > > > >          VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
> > > > > > > > > >            if (vsocket->notify_ops->new_connection) {
> > > > > > > > > > @@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,
> > > > > > > > > > uint64_t flags)
> > > > > > > > > >          vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
> > > > > > > > > >          vsocket->linearbuf = flags &
> > > > > RTE_VHOST_USER_LINEARBUF_SUPPORT;
> > > > > > > > > >          vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
> > > > > > > > > > +    vsocket->async_use_vfio = flags &
> > > > > > > > RTE_VHOST_USER_ASYNC_USE_VFIO;
> > > > > > > > > >          vsocket->net_compliant_ol_flags = flags &
> > > > > > > > > > RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
> > > > > > > > > >            if (vsocket->async_copy &&
> > > > > > > > > > diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
> > > > > > > > > > index 8078ddff79..fb775ce4ed 100644
> > > > > > > > > > --- a/lib/vhost/vhost.h
> > > > > > > > > > +++ b/lib/vhost/vhost.h
> > > > > > > > > > @@ -370,6 +370,7 @@ struct virtio_net {
> > > > > > > > > >          int16_t            broadcast_rarp;
> > > > > > > > > >          uint32_t        nr_vring;
> > > > > > > > > >          int            async_copy;
> > > > > > > > > > +    int            async_use_vfio;
> > > > > > > > > >          int            extbuf;
> > > > > > > > > >          int            linearbuf;
> > > > > > > > > >          struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS *
> > > > > > > > > > 2];
> > > > > > > > > > diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> > > > > > > > > > index 8f0eba6412..f3703f2e72 100644
> > > > > > > > > > --- a/lib/vhost/vhost_user.c
> > > > > > > > > > +++ b/lib/vhost/vhost_user.c
> > > > > > > > > > @@ -45,6 +45,7 @@
> > > > > > > > > >      #include <rte_common.h>
> > > > > > > > > >      #include <rte_malloc.h>
> > > > > > > > > >      #include <rte_log.h>
> > > > > > > > > > +#include <rte_vfio.h>
> > > > > > > > > >        #include "iotlb.h"
> > > > > > > > > >      #include "vhost.h"
> > > > > > > > > > @@ -141,6 +142,36 @@ get_blk_size(int fd)
> > > > > > > > > >          return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
> > > > > > > > > >      }
> > > > > > > > > >      +static int
> > > > > > > > > > +async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
> > > > > > > > > > +{
> > > > > > > > > > +    int ret = 0;
> > > > > > > > > > +    uint64_t host_iova;
> > > > > > > > > > +    host_iova = rte_mem_virt2iova((void
> > > > > > > > > > *)(uintptr_t)region->host_user_addr);
> > > > > > > > > > +    if (do_map) {
> > > > > > > > > > +        /* Add mapped region into the default container of DPDK. */
> > > > > > > > > > +        ret =
> > > > > > > > rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > > > > > > > > +                         region->host_user_addr,
> > > > > > > > > > +                         host_iova,
> > > > > > > > > > +                         region->size);
> > > > > > > > > > +        if (ret) {
> > > > > > > > > > +            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
> > > > > > > > > > +            return ret;
> > > > > > > > > > +        }
> > > > > > > > > > +    } else {
> > > > > > > > > > +        /* Remove mapped region from the default container of
> > > > > > > > > > DPDK. */
> > > > > > > > > > +        ret =
> > > > > > > > > > rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
> > > > > > > > > > +                           region->host_user_addr,
> > > > > > > > > > +                           host_iova,
> > > > > > > > > > +                           region->size);
> > > > > > > > > > +        if (ret) {
> > > > > > > > > > +            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
> > > > > > > > > > +            return ret;
> > > > > > > > > > +        }
> > > > > > > > > > +    }
> > > > > > > > > > +    return ret;
> > > > > > > > > > +}
> > > > > > > > > 
> > > > > > > > > We've been discussing this off list with Xuan, and unfortunately
> > > > > > > > > this is
> > > > > > > > > a blocker for now.
> > > > > > > > > 
> > > > > > > > > Currently, the x86 IOMMU does not support partial unmap - the
> > > segments
> > > > > > > > > have to be unmapped exactly the same addr/len as they were mapped.
> > > We
> > > > > > > > > also concatenate adjacent mappings to prevent filling up the DMA
> > > > > > > > > mapping
> > > > > > > > > entry table with superfluous entries.
> > > > > > > > > 
> > > > > > > > > This means that, when two unrelated mappings are contiguous in
> > > memory
> > > > > > > > > (e.g. if you map regions 1 and 2 independently, but they happen to be
> > > > > > > > > sitting right next to each other in virtual memory), we cannot later
> > > > > > > > > unmap one of them because, even though these are two separate
> > > > > > > > mappings
> > > > > > > > > as far as kernel VFIO infrastructure is concerned, the mapping gets
> > > > > > > > > compacted and looks like one single mapping to VFIO, so DPDK API will
> > > > > > > > > not let us unmap region 1 without also unmapping region 2.
> > > > > > > > > 
> > > > > > > > > The proper fix for this problem would be to always map memory
> > > > > > > > > page-by-page regardless of where it comes from (we already do that for
> > > > > > > > > internal memory, but not for external). However, the reason this works
> > > > > > > > > for internal memory is because when mapping internal memory
> > > segments,
> > > > > > > > > *we know the page size*. For external memory segments, there is no
> > > such
> > > > > > > > > guarantee, so we cannot deduce page size for a given memory segment,
> > > > > > > > and
> > > > > > > > > thus can't map things page-by-page.
> > > > > > > > > 
> > > > > > > > > So, the proper fix for it would be to add page size to the VFIO DMA
> > > > > > > > > API.
> > > > > > > > > Unfortunately, it probably has to wait until 21.11 because it is an API
> > > > > > > > > change.
> > > > > > > > > 
> > > > > > > > > The slightly hacky fix for this would be to forego user mem map
> > > > > > > > > concatenation and trust that user is not going to do anything stupid,
> > > > > > > > > and will not spam the VFIO DMA API without reason. I would rather
> > > > > > > > > not go
> > > > > > > > > down this road, but this could be an option in this case.
> > > > > > > > > 
> > > > > > > > > Thoughts?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks Anatoly for the detailed description of the issue.
> > > > > > > > It may be possible to either create a versioned symbol for this API
> > > > > > > > change, or maybe even to have a temporary internal API.
> > > > > > > > 
> > > > > > > > But I think this series in its current form is not acceptable, so
> > > > > > > > waiting for v21.11 would be the best option (we may want to send the
> > > > > > > > deprecation notice in this release though).
> > > > > > > > 
> > > > > > > > In this series, I don't like the user application has to pass a flag to
> > > > > > > > state whether the DMA engine uses VFIO or not. AFAICT, this new revision
> > > > > > > > does not implement what was discussed in the previous one, i.e.
> > > > > > > > supporting both IOVA_AS_VA and IOVA_AS_PA.
> > > > > > > 
> > > > > > > Thanks for your comments. Here I hope to explain some questions:
> > > > > > > 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
> > > > > > > A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version,
> > > the
> > > > > > > virtual address is replaced with iova address of mapped region, and
> > > > > > > the iova
> > > > > > > address is selected to program the IOMMU instead of virtual address only.
> > > > > 
> > > > > Good!
> > > > > 
> > > > > > > 
> > > > > > > 2. Why a flag is chosen to be passed by application?
> > > > > > > A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
> > > > > > > used to
> > > > > > > get the IOVA mode, so as to determine whether IOMMU should be
> > > > > programmed.
> > > > > > > However, in the implementation process, I found a problem. That is how to
> > > > > > > distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we
> > > should
> > > > > > > always program the IOMMU. While in IGB_UIO cases, it depends on
> > > IOMMU
> > > > > > > capability of platform.
> > > > > > 
> > > > > > How does one program IOMMU with igb_uio? I was under impression that
> > > > > > igb_uio (and uio_pci_generic for that matter) does not provide such
> > > > > > facilities.
> > > > > 
> > > > > +1
> > > > 
> > > > Maybe some misunderstanding in this sentence here.
> > > > In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be
> > > programmed.
> > > > True means vfio module is modprobed.
> > > > 
> > > > But there is an exception here, that is, even if vfio module is modprobed,
> > > > DPDK user still bind all the devices to igb_uio.
> > > > 
> > > > This situation can be distinguished in DPDK eal initialization, because the
> > > resource mapping
> > > > is according to the driver loaded by each device(rte_pci_map_device).
> > > > 
> > > > But in our scenario, this judgment is somewhat weak. Because we cannot get
> > > > the device driver info in vhost library. I also think it is unreasonable for vhost to
> > > > do this. Only trust that users will not use it like this. Thoughts for this scenario?
> > > 
> > > I don't see how igb_uio would make any difference at all. If you are
> > > using igb_uio, you *don't have DMA mapping at all* and will use raw
> > > physical addresses. Assuming your code supports this, that's all you're
> > > ever going to get. The point of VFIO is to have memory regions that are
> > > mapped for DMA *because real physical addresses are assumed to be not
> > > available*. When you're using igb_uio, you effectively do have DMA
> > > access to the entire memory, and thus can bypass IOMMU altogether
> > > (assuming you're using passthrough mode).
> > 
> > My concern is exactly here.
> > In igb_uio cases, although devices are not added to the default container in eal init,
> > but the "IOMMU programming" actually happens when the rte_vfio_container_dma_map() is called.
> > It is no harm but it is also unnecessary.
> 
> Yes, it is unnecessary, but it's also not actively harmful, which means you
> can still do it without any regard as to whether you do or don't have IOMMU
> :)
> 
> Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some will
> have igb_uio. The igb_uio-bound NICs will not care if you have mapped
> anything for DMA because they don't go through IOMMU, things will "just
> work". The VFIO-bound NICs will get the memory mapped, because they are the
> ones who actually need the DMA mapping.
> 
Do we even support a hybrid setup. I would have thought that was just
asking for problems and should be considered an unsupported configuration?


More information about the dev mailing list