[dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing
Burakov, Anatoly
anatoly.burakov at intel.com
Fri May 1 11:06:28 CEST 2020
On 30-Apr-20 6:36 PM, David Christensen wrote:
>
>
> On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
>> On 30-Apr-20 12:29 AM, David Christensen wrote:
>>> Current SPAPR IOMMU support code dynamically modifies the DMA window
>>> size in response to every new memory allocation. This is potentially
>>> dangerous because all existing mappings need to be unmapped/remapped in
>>> order to resize the DMA window, leaving hardware holding IOVA addresses
>>> that are not properly prepared for DMA. The new SPAPR code statically
>>> assigns the DMA window size on first use, using the largest physical
>>> memory address when IOVA=PA and the base_virtaddr + physical memory size
>>> when IOVA=VA. As a result, memory will only be unmapped when
>>> specifically requested.
>>>
>>> Signed-off-by: David Christensen <drc at linux.vnet.ibm.com>
>>> ---
>>
>> Hi David,
>>
>> I haven't yet looked at the code in detail (will do so later), but
>> some general comments and questions below.
>>
>>> + /*
>>> + * Read "System RAM" in /proc/iomem:
>>> + * 00000000-1fffffffff : System RAM
>>> + * 200000000000-201fffffffff : System RAM
>>> + */
>>> + FILE *fd = fopen(proc_iomem, "r");
>>> + if (fd == NULL) {
>>> + RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
>>> + return -1;
>>> + }
>>
>> A quick check on my machines shows that when cat'ing /proc/iomem as
>> non-root, you get zeroes everywhere, which leads me to believe that
>> you have to be root to get anything useful out of /proc/iomem. Since
>> one of the major selling points of VFIO is the ability to run as
>> non-root, depending on iomem kind of defeats the purpose a bit.
>
> I observed the same thing on my system during development. I didn't see
> anything that precluded support for RTE_IOVA_PA in the VFIO code. Are
> you suggesting that I should explicitly not support that configuration?
> If you're attempting to use RTE_IOVA_PA then you're already required to
> run as root, so there shouldn't be an issue accessing this
Oh, right, forgot about that. That's OK then.
>
>>> + return 0;
>>> +
>>> + } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>>> + /* Set the DMA window to base_virtaddr + system memory size */
>>> + const char proc_meminfo[] = "/proc/meminfo";
>>> + const char str_memtotal[] = "MemTotal:";
>>> + int memtotal_len = sizeof(str_memtotal) - 1;
>>> + char buffer[256];
>>> + uint64_t size = 0;
>>> +
>>> + FILE *fd = fopen(proc_meminfo, "r");
>>> + if (fd == NULL) {
>>> + RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
>>> + return -1;
>>> + }
>>> + while (fgets(buffer, sizeof(buffer), fd)) {
>>> + if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
>>> + size = rte_str_to_size(&buffer[memtotal_len]);
>>> + break;
>>> + }
>>> + }
>>> + fclose(fd);
>>> +
>>> + if (size == 0) {
>>> + RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\"
>>> entry "
>>> + "in file %s\n", proc_meminfo);
>>> + return -1;
>>> + }
>>> +
>>> + RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
>>> + /* if no base virtual address is configured use 4GB */
>>> + spapr_dma_win_len = rte_align64pow2(size +
>>> + (internal_config.base_virtaddr > 0 ?
>>> + (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
>>> + rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
>>
>> I'm not sure of the algorithm for "memory size" here.
>>
>> Technically, DPDK can reserve memory segments anywhere in the VA space
>> allocated by memseg lists. That space may be far bigger than system
>> memory (on a typical Intel server board you'd see 128GB of VA space
>> preallocated even though the machine itself might only have, say, 16GB
>> of RAM installed). The same applies to any other arch running on
>> Linux, so the window needs to cover at least RTE_MIN(base_virtaddr,
>> lowest memseglist VA address) and up to highest memseglist VA address.
>> That's not even mentioning the fact that the user may register
>> external memory for DMA which may cause the window to be of
>> insufficient size to cover said external memory.
>>
>> I also think that in general, "system memory" metric is ill suited for
>> measuring VA space, because unlike system memory, the VA space is
>> sparse and can therefore span *a lot* of address space even though in
>> reality it may actually use very little physical memory.
>
> I'm open to suggestions here. Perhaps an alternative in /proc/meminfo:
>
> VmallocTotal: 549755813888 kB
>
> I tested it with 1GB hugepages and it works, need to check with 2M as
> well. If there's no alternative for sizing the window based on
> available system parameters then I have another option which creates a
> new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X
> where X is configured on the EAL command-line (--iova-base, --iova-len).
> I use these command-line values to create a static window.
>
A whole new IOVA mode, while being a cleaner solution, would require a
lot of testing, and it doesn't really solve the external memory problem,
because we're still reliant on the user to provide IOVA addresses.
Perhaps something akin to VA/IOVA address reservation would solve the
problem, but again, lots of changes and testing, all for a comparatively
narrow use case.
The vmalloc area seems big enough (512 terabytes on your machine, 32
terabytes on mine), so it'll probably be OK. I'd settle for:
1) start at base_virtaddr OR lowest memseg list address, whichever is lowest
2) end at lowest addr + VmallocTotal OR highest memseglist addr,
whichever is higher
3) a check in user DMA map function that would warn/throw an error
whenever there is an attempt to map an address for DMA that doesn't fit
into the DMA window
I think that would be best approach. Thoughts?
> Dave
>
> Dave
>
>
--
Thanks,
Anatoly
More information about the dev
mailing list