[dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
Burakov, Anatoly
anatoly.burakov at intel.com
Tue May 5 16:57:01 CEST 2020
On 01-May-20 5:48 PM, David Christensen wrote:
>>>> I'm not sure of the algorithm for "memory size" here.
>>>>
>>>> Technically, DPDK can reserve memory segments anywhere in the VA
>>>> space allocated by memseg lists. That space may be far bigger than
>>>> system memory (on a typical Intel server board you'd see 128GB of VA
>>>> space preallocated even though the machine itself might only have,
>>>> say, 16GB of RAM installed). The same applies to any other arch
>>>> running on Linux, so the window needs to cover at least
>>>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to
>>>> highest memseglist VA address. That's not even mentioning the fact
>>>> that the user may register external memory for DMA which may cause
>>>> the window to be of insufficient size to cover said external memory.
>>>>
>>>> I also think that in general, "system memory" metric is ill suited
>>>> for measuring VA space, because unlike system memory, the VA space
>>>> is sparse and can therefore span *a lot* of address space even
>>>> though in reality it may actually use very little physical memory.
>>>
>>> I'm open to suggestions here. Perhaps an alternative in /proc/meminfo:
>>>
>>> VmallocTotal: 549755813888 kB
>>>
>>> I tested it with 1GB hugepages and it works, need to check with 2M as
>>> well. If there's no alternative for sizing the window based on
>>> available system parameters then I have another option which creates
>>> a new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to
>>> X where X is configured on the EAL command-line (--iova-base,
>>> --iova-len). I use these command-line values to create a static
>>> window.
>>>
>>
>> A whole new IOVA mode, while being a cleaner solution, would require a
>> lot of testing, and it doesn't really solve the external memory
>> problem, because we're still reliant on the user to provide IOVA
>> addresses. Perhaps something akin to VA/IOVA address reservation would
>> solve the problem, but again, lots of changes and testing, all for a
>> comparatively narrow use case.
>>
>> The vmalloc area seems big enough (512 terabytes on your machine, 32
>> terabytes on mine), so it'll probably be OK. I'd settle for:
>>
>> 1) start at base_virtaddr OR lowest memseg list address, whichever is
>> lowest
>
> The IOMMU only supports two starting addresses, 0 or 1<<59, so
> implementation will need to start at 0. (I've been bit by this before,
> my understanding is that the processor only supports 54 bits of the
> address and that the PCI host bridge uses bit 59 of the IOVA as a signal
> to do the address translation for the second DMA window.)
Fair enough, 0 it is then.
>
>> 2) end at lowest addr + VmallocTotal OR highest memseglist addr,
>> whichever is higher
>
> So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find
> the lowest/highest msl addresses?
Yep. rte_memseg_walk() will only cover allocated pages, while
rte_memseg_list_walk() will cover even empty page tables.
>
>> 3) a check in user DMA map function that would warn/throw an error
>> whenever there is an attempt to map an address for DMA that doesn't
>> fit into the DMA window
>
> Isn't this mostly prevented by the use of rte_mem_set_dma_mask() and
> rte_mem_check_dma_mask()? I'd expect an error would be thrown by the
> kernel IOMMU API for an out-of-range mapping that I would simply return
> to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment
> /* iova is checked by the IOMMU API */). Why do you think double
> checking this would help?
I don't think we check rte_mem_check_dma_mask() anywhere in the call
path of external memory code. Also, i just checked, and you're right,
rte_vfio_container_dma_map() will fail if the kernel fails to map the
memory, however nothing will fail in external memory because the IOVA
addresses aren't checked for being within DMA mask.
See malloc_heap.c:1097 onwards, we simply add user-specified IOVA
addresses into the page table without checking if the fit into the DMA
mask. The DMA mapping will then happen through a mem event callback, but
we don't check return value of that callback either, so even if DMA
mapping fails, we'll only get a log message.
So, perhaps the real solution here is to add a DMA mask check into
rte_malloc_heap_memory_add(), so that we check the IOVA addresses before
we ever try to do anything with them. I'll submit a patch for this.
>
>>
>> I think that would be best approach. Thoughts?
>
> Dave
--
Thanks,
Anatoly
More information about the dev
mailing list