[dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing

David Christensen drc at linux.vnet.ibm.com
Thu Apr 30 19:36:17 CEST 2020



On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
> On 30-Apr-20 12:29 AM, David Christensen wrote:
>> Current SPAPR IOMMU support code dynamically modifies the DMA window
>> size in response to every new memory allocation. This is potentially
>> dangerous because all existing mappings need to be unmapped/remapped in
>> order to resize the DMA window, leaving hardware holding IOVA addresses
>> that are not properly prepared for DMA.  The new SPAPR code statically
>> assigns the DMA window size on first use, using the largest physical
>> memory address when IOVA=PA and the base_virtaddr + physical memory size
>> when IOVA=VA.  As a result, memory will only be unmapped when
>> specifically requested.
>>
>> Signed-off-by: David Christensen <drc at linux.vnet.ibm.com>
>> ---
> 
> Hi David,
> 
> I haven't yet looked at the code in detail (will do so later), but some 
> general comments and questions below.
> 
>> +        /*
>> +         * Read "System RAM" in /proc/iomem:
>> +         * 00000000-1fffffffff : System RAM
>> +         * 200000000000-201fffffffff : System RAM
>> +         */
>> +        FILE *fd = fopen(proc_iomem, "r");
>> +        if (fd == NULL) {
>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
>> +            return -1;
>> +        }
> 
> A quick check on my machines shows that when cat'ing /proc/iomem as 
> non-root, you get zeroes everywhere, which leads me to believe that you 
> have to be root to get anything useful out of /proc/iomem. Since one of 
> the major selling points of VFIO is the ability to run as non-root, 
> depending on iomem kind of defeats the purpose a bit.

I observed the same thing on my system during development.  I didn't see 
anything that precluded support for RTE_IOVA_PA in the VFIO code.  Are 
you suggesting that I should explicitly not support that configuration? 
If you're attempting to use RTE_IOVA_PA then you're already required to 
run as root, so there shouldn't be an issue accessing this

>> +        return 0;
>> +
>> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>> +        /* Set the DMA window to base_virtaddr + system memory size */
>> +        const char proc_meminfo[] = "/proc/meminfo";
>> +        const char str_memtotal[] = "MemTotal:";
>> +        int memtotal_len = sizeof(str_memtotal) - 1;
>> +        char buffer[256];
>> +        uint64_t size = 0;
>> +
>> +        FILE *fd = fopen(proc_meminfo, "r");
>> +        if (fd == NULL) {
>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
>> +            return -1;
>> +        }
>> +        while (fgets(buffer, sizeof(buffer), fd)) {
>> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
>> +                size = rte_str_to_size(&buffer[memtotal_len]);
>> +                break;
>> +            }
>> +        }
>> +        fclose(fd);
>> +
>> +        if (size == 0) {
>> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry "
>> +                "in file %s\n", proc_meminfo);
>> +            return -1;
>> +        }
>> +
>> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
>> +        /* if no base virtual address is configured use 4GB */
>> +        spapr_dma_win_len = rte_align64pow2(size +
>> +            (internal_config.base_virtaddr > 0 ?
>> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
>> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
> 
> I'm not sure of the algorithm for "memory size" here.
> 
> Technically, DPDK can reserve memory segments anywhere in the VA space 
> allocated by memseg lists. That space may be far bigger than system 
> memory (on a typical Intel server board you'd see 128GB of VA space 
> preallocated even though the machine itself might only have, say, 16GB 
> of RAM installed). The same applies to any other arch running on Linux, 
> so the window needs to cover at least RTE_MIN(base_virtaddr, lowest 
> memseglist VA address) and up to highest memseglist VA address. That's 
> not even mentioning the fact that the user may register external memory 
> for DMA which may cause the window to be of insufficient size to cover 
> said external memory.
> 
> I also think that in general, "system memory" metric is ill suited for 
> measuring VA space, because unlike system memory, the VA space is sparse 
> and can therefore span *a lot* of address space even though in reality 
> it may actually use very little physical memory.

I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo:

VmallocTotal:   549755813888 kB

I tested it with 1GB hugepages and it works, need to check with 2M as 
well.  If there's no alternative for sizing the window based on 
available system parameters then I have another option which creates a 
new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X 
where X is configured on the EAL command-line (--iova-base, --iova-len). 
  I use these command-line values to create a static window.

Dave

Dave




More information about the dev mailing list