[dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1

Nithin Dabilpuram nithind1988 at gmail.com
Wed Oct 28 17:31:27 CET 2020


On Wed, Oct 28, 2020 at 04:07:17PM +0000, Burakov, Anatoly wrote:
> On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote:
> > On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
> > > On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> > > > Ping.
> > > > 
> > > > On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> > > > > On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > > > > > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > > > > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > > > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > > > > > <anatoly.burakov at intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > > > > > External Email
> > > > > > > > > > > > 
> > > > > > > > > > > > ----------------------------------------------------------------------
> > > > > > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > > > > > returned unmap size and return error.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > > > > > result of which the page concatenation code was added.
> > > > > > > > > > > > 
> > > > > > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > > > > > 
> > > > > > > > > > <snip>
> > > > > > > > > > 
> > > > > > > > > > > > >                         RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > > > > > >                                         errno, strerror(errno));
> > > > > > > > > > > > >                         return -1;
> > > > > > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > > > > > +                   return -1;
> > > > > > > > > > > > >                 }
> > > > > > > > > > > > >         }
> > > > > > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > > > > > >                 /* we're partially unmapping a previously mapped region, so we
> > > > > > > > > > > > >                  * need to split entry into two.
> > > > > > > > > > > > >                  */
> > > > > > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > > > > > +                   ret = -1;
> > > > > > > > > > > > > +                   goto out;
> > > > > > > > > > > > > +           }
> > > > > > > > > > > > 
> > > > > > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > > > > > 
> > > > > > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > > > > > 
> > > > > > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > > > > > 
> > > > > > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > > > > > maintain that.
> > > > > > > > > 
> > > > > > > > > But here I think we don't know if user's allocation is huge page or
> > > > > > > > > collection of system
> > > > > > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > > > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > > > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > > > > > 
> > > > > > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > > > > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > > > > > with 1GB huge page or 4K system page.
> > > > > > > 
> > > > > > > Am I missing something ?
> > > > > > 
> > > > > > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > > > > > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > > > > > mapped for DMA as a contiguous 4K-based block?
> > > > > 
> > > > > I'm not suggesting anything. My only thought is how to solve below problem.
> > > > > Say application does the following.
> > > > > 
> > > > > #1 Allocate 1GB memory from huge page or some external mem.
> > > > > #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
> > > > >      In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
> > > > >      don't know where this memory is coming from or backed by what.
> > > > > #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
> > > > > Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> > > > > In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> > > > > can we allow #3 ?
> > > > > 
> > > > > 
> > > > > static int
> > > > > container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > >                   uint64_t len)
> > > > > {
> > > > >           struct user_mem_map *map, *new_map = NULL;
> > > > >           struct user_mem_maps *user_mem_maps;
> > > > >           int ret = 0;
> > > > > 
> > > > >           user_mem_maps = &vfio_cfg->mem_maps;
> > > > >           rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > > > > 
> > > > >           /* find our mapping */
> > > > >           map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
> > > > >           if (!map) {
> > > > >                   RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
> > > > >                   rte_errno = EINVAL;
> > > > >                   ret = -1;
> > > > >                   goto out;
> > > > >           }
> > > > >           if (map->addr != vaddr || map->iova != iova || map->len != len) {
> > > > >                   /* we're partially unmapping a previously mapped region, so we
> > > > >                    * need to split entry into two.
> > > > >                    */
> > > 
> > > Hi,
> > > 
> > > Apologies, i was on vacation.
> > > 
> > > Yes, I can see the problem now. Does VFIO even support non-system page
> > > sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
> > > for DMA, as opposed to first 4K of this page? I suspect that the mapping
> > > doesn't support page sizes other than the system page size.
> > 
> > It does support mapping any multiple of system page size.
> > See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
> > ./driver-api/vfio.rst doesn't mention any such restrictions even in its
> > example.
> > 
> > Also my test case is passing so that confirms the behavior.
> 
> Can we perhaps make it so that the API mandates mapping/unmapping the same
> chunks? That would be the easiest solution here.

Ack, I was already doing that for type1 IOMMU with my above patch.
I didn't change the behavior for sPAPR or no-iommu mode.
> 
> > 
> > 
> > > 
> > > -- 
> > > Thanks,
> > > Anatoly
> 
> 
> -- 
> Thanks,
> Anatoly


More information about the dev mailing list