[dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing

Burakov, Anatoly anatoly.burakov at intel.com
Tue May 5 16:57:01 CEST 2020


On 01-May-20 5:48 PM, David Christensen wrote:
>>>> I'm not sure of the algorithm for "memory size" here.
>>>>
>>>> Technically, DPDK can reserve memory segments anywhere in the VA 
>>>> space allocated by memseg lists. That space may be far bigger than 
>>>> system memory (on a typical Intel server board you'd see 128GB of VA 
>>>> space preallocated even though the machine itself might only have, 
>>>> say, 16GB of RAM installed). The same applies to any other arch 
>>>> running on Linux, so the window needs to cover at least 
>>>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to 
>>>> highest memseglist VA address. That's not even mentioning the fact 
>>>> that the user may register external memory for DMA which may cause 
>>>> the window to be of insufficient size to cover said external memory.
>>>>
>>>> I also think that in general, "system memory" metric is ill suited 
>>>> for measuring VA space, because unlike system memory, the VA space 
>>>> is sparse and can therefore span *a lot* of address space even 
>>>> though in reality it may actually use very little physical memory.
>>>
>>> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:
>>>
>>> VmallocTotal:   549755813888 kB
>>>
>>> I tested it with 1GB hugepages and it works, need to check with 2M as 
>>> well.  If there's no alternative for sizing the window based on 
>>> available system parameters then I have another option which creates 
>>> a new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to 
>>> X where X is configured on the EAL command-line (--iova-base, 
>>> --iova-len).   I use these command-line values to create a static 
>>> window.
>>>
>>
>> A whole new IOVA mode, while being a cleaner solution, would require a 
>> lot of testing, and it doesn't really solve the external memory 
>> problem, because we're still reliant on the user to provide IOVA 
>> addresses. Perhaps something akin to VA/IOVA address reservation would 
>> solve the problem, but again, lots of changes and testing, all for a 
>> comparatively narrow use case.
>>
>> The vmalloc area seems big enough (512 terabytes on your machine, 32 
>> terabytes on mine), so it'll probably be OK. I'd settle for:
>>
>> 1) start at base_virtaddr OR lowest memseg list address, whichever is 
>> lowest
> 
> The IOMMU only supports two starting addresses, 0 or 1<<59, so 
> implementation will need to start at 0.  (I've been bit by this before, 
> my understanding is that the processor only supports 54 bits of the 
> address and that the PCI host bridge uses bit 59 of the IOVA as a signal 
> to do the address translation for the second DMA window.)

Fair enough, 0 it is then.

> 
>> 2) end at lowest addr + VmallocTotal OR highest memseglist addr, 
>> whichever is higher
> 
> So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find 
> the lowest/highest msl addresses?

Yep. rte_memseg_walk() will only cover allocated pages, while 
rte_memseg_list_walk() will cover even empty page tables.

> 
>> 3) a check in user DMA map function that would warn/throw an error 
>> whenever there is an attempt to map an address for DMA that doesn't 
>> fit into the DMA window
> 
> Isn't this mostly prevented by the use of  rte_mem_set_dma_mask() and 
> rte_mem_check_dma_mask()?  I'd expect an error would be thrown by the 
> kernel IOMMU API for an out-of-range mapping that I would simply return 
> to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment 
> /* iova is checked by the IOMMU API */).  Why do you think double 
> checking this would help?

I don't think we check rte_mem_check_dma_mask() anywhere in the call 
path of external memory code. Also, i just checked, and you're right, 
rte_vfio_container_dma_map() will fail if the kernel fails to map the 
memory, however nothing will fail in external memory because the IOVA 
addresses aren't checked for being within DMA mask.

See malloc_heap.c:1097 onwards, we simply add user-specified IOVA 
addresses into the page table without checking if the fit into the DMA 
mask. The DMA mapping will then happen through a mem event callback, but 
we don't check return value of that callback either, so even if DMA 
mapping fails, we'll only get a log message.

So, perhaps the real solution here is to add a DMA mask check into 
rte_malloc_heap_memory_add(), so that we check the IOVA addresses before 
we ever try to do anything with them. I'll submit a patch for this.

> 
>>
>> I think that would be best approach. Thoughts?
> 
> Dave


-- 
Thanks,
Anatoly


More information about the dev mailing list