[dpdk-dev] [PATCH] pci_vfio: Support 64KB kernel page_size with vfio-pci driver

Burakov, Anatoly anatoly.burakov at intel.com
Wed Nov 7 11:12:53 CET 2018


On 07-Nov-18 4:55 AM, Tone Zhang (Arm Technology China) wrote:
> Hi Burakov,
> 
> Please find my test case below. Thanks!
> 
> Br,
> Tone
> 
> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov at intel.com>
> Sent: Tuesday, November 6, 2018 7:03 PM
> To: Tone Zhang (Arm Technology China) <Tone.Zhang at arm.com>; dev at dpdk.org
> Cc: nd <nd at arm.com>
> Subject: Re: [dpdk-dev] [PATCH] pci_vfio: Support 64KB kernel page_size with vfio-pci driver
> 
> On 03-Nov-18 5:46 AM, Tone Zhang (Arm Technology China) wrote:
>> Hi Burakov,
>>
>> Thanks!
>> Please check my feedback below.
>>
>> Br,
>> Tone
>>
>> -----Original Message-----
>> From: dev <dev-bounces at dpdk.org> On Behalf Of Burakov, Anatoly
>> Sent: Thursday, November 1, 2018 6:01 PM
>> To: Tone Zhang (Arm Technology China) <Tone.Zhang at arm.com>;
>> dev at dpdk.org
>> Cc: nd <nd at arm.com>
>> Subject: Re: [dpdk-dev] [PATCH] pci_vfio: Support 64KB kernel
>> page_size with vfio-pci driver
>>
>> On 01-Nov-18 2:33 AM, Tone Zhang (Arm Technology China) wrote:
>>> Hi Burakov,
>>>
>>> I'm sorry for the late response.
>>>
>>> Thanks a lot for your comments. Please find my response below (marked
>>> with "Tone:"). 😊
>>>
>>> Br,
>>> Tone
>>>
>>> -----Original Message-----
>>> From: Burakov, Anatoly <anatoly.burakov at intel.com>
>>> Sent: Wednesday, October 24, 2018 5:09 PM
>>> To: Tone Zhang (Arm Technology China) <Tone.Zhang at arm.com>;
>>> dev at dpdk.org
>>> Cc: nd <nd at arm.com>
>>> Subject: Re: [dpdk-dev] [PATCH] pci_vfio: Support 64KB kernel
>>> page_size with vfio-pci driver
>>>
>>> On 24-Oct-18 3:20 AM, tone.zhang wrote:
>>>> With a larger PAGE_SIZE it is possible for the MSI table to very
>>>> close to the end of the BAR s.t. when we align the MSI table to the
>>>> PAGE_SIZE, the end offset of the MSI table is out the PCI BAR
>>>> boundary.
>>>>
>>>> This patch addresses the issue by comparing both the start and the
>>>> end offset of the MSI table with the BAR size.
>>>>
>>>> The patch fixes the debug log as below:
>>>> EAL: Skipping BAR0
>>>>
>>>> Signed-off-by: tone.zhang <tone.zhang at arm.com>
>>>> Reviewed-by: Gavin Hu <Gavin.Hu at arm.com>
>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
>>>> Reviewed-by: Steve Capper <Steve.Capper at arm.com>
>>>> ---
>>>>      drivers/bus/pci/linux/pci_vfio.c | 25 +++++++++++++++++++++----
>>>>      1 file changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/bus/pci/linux/pci_vfio.c
>>>> b/drivers/bus/pci/linux/pci_vfio.c
>>>> index b1f0683..1373345 100644
>>>> --- a/drivers/bus/pci/linux/pci_vfio.c
>>>> +++ b/drivers/bus/pci/linux/pci_vfio.c
>>>> @@ -445,9 +445,11 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
>>>>      	struct pci_msix_table *msix_table = &vfio_res->msix_table;
>>>>      	struct pci_map *bar = &vfio_res->maps[bar_index];
>>>>      
>>>> -	if (bar->size == 0)
>>>> +	if (bar->size == 0) {
>>>>      		/* Skip this BAR */
>>>> +		RTE_LOG(INFO, EAL, "Skipping this BAR%d\n", bar_index);
>>>>      		return 0;
>>>
>>> I feel like "this" is unnecessary here - just "Skipping BAR%d" should
>>> be enough :)
>>>
>>> Tone: Will update code and remove "this" in next version.
>>>
>>>> +	}
>>>>      
>>>>      	if (msix_table->bar_index == bar_index) {
>>>>      		/*
>>>> @@ -457,7 +459,12 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
>>>>      		uint32_t table_start = msix_table->offset;
>>>>      		uint32_t table_end = table_start + msix_table->size;
>>>>      		table_end = (table_end + ~PAGE_MASK) & PAGE_MASK;
>>>> -		table_start &= PAGE_MASK;
>>>> +		table_start = (table_start + ~PAGE_MASK) & PAGE_MASK;
>>>
>>> IMO these two additions should be replaced by RTE_ALIGN by page size.
>>> Makes the purpose of the code much clearer.
>>>
>>> Tone: Sure, it is better! Will update code in next version. Thanks!
>>>
>>>> +		/* after rounding to PAGE_SIZE, it is over the bar->size,
>>>> +		 * fall back to the MSI-X table offset in the bar.
>>>> +		*/
>>>> +		if (table_start >= bar->size)
>>>> +			table_start = msix_table->offset;
>>>
>>> If i understand things correctly, msix_table->offset value here may be unaligned, so falling back to this value may cause mapping failure, because we later use this value as a size of mapping (which needs to be page aligned). Shouldn't this be aligned using RTE_ALIGN_FLOOR by page size?
>>>
>>> Tone: It is a little tricky. Align msix_table->offset with RTE_ALIGN_FLOOR maybe get 0 if the offset is less than page size in the PCI bar. It will trigger mmap() error. IIRC the input parameter "size" in mmap() is not required to be aligned with page size, system will do it. But it is better if we can do it. If I was wrong, please correct me. Thanks a lot.
>>
>> Apologies, you're correct - length can be misaligned (just tested it).
>>
>> However, i think it's still worth aligning (and putting in an additional check), because we want to make sure we *don't* attempt to map the MSI-X BAR, and kernel might do that by adjusting length automatically and return mmap failure that way.
>>
>> Tone: Thanks a lot! I agree with you. It worth aligning the size. I will update code (RTE_ALIGN_FLOOR by page size) in next version.
>> I'd like to discuss one case with you. In the case, base->size is 16384, msix_table->offset is 8192, page_size is 65536. After align "msix_table->offset" with page_size (RTE_ALIGN_FLOOR), the value of "table_start" is 0, mmap() will report error, and the memory mapping is failed.
>> For the case (table_start is 0 after the aglinment), may I continue falling back the "table_start" to " msix_table->offset" (not aligned with page size), and left system adjust the length automatically? Thanks!
> 
> Please correct me if i'm wrong, but this is a code path for when we're trying to mmap around the MSI-X BAR. Kernel will not allow us to do that, period, so whatever start/end addresses you get, they *must not* include a single byte of MSI-X BAR. So, in case like you described, i think we should just straight up refuse the map the entire BAR.
> 
> However, as i do not have a system with such properties to test on, so please correct me if i'm wrong here :)
> 
> 
> Tone: I understand and agree with you.  😊
> 
> Please have a look at my test case. In my case, I tried to bind NVMe device with VFIO driver and the kernel page size is 64KB. Without the change, the test is failed.
> 
>  From the debug information, I observed that "bar->size" is 16384, "msix_table->offset" is 8192 and "msix_table->size" is 512. Regarding the page size is much bigger than the "bar->size", in the change, the code maps the first 8192 bytes ahead of MSI-X table. After align with the page size boundary, the "start" offset after the MSI-X table is over "bar->size", mmap() reports error. In this case, I can only map the memory before the MSI-X table. After fall back "table_start" to " msix_table->offset " (i.e. 8192 bytes), and NOT mapping the memory behind MSI-X table, the NVMe device can be bound to VFIO driver, and the test is passed. The kernel version in my test environment is 4.16.
> 
> So in the change, I do not map any byte of MSI-X table, unfortunately I cannot align the memory "size" in mmap() to page size boundary. From the test result, the change fixes the error. The case looks a little tricky. If we refuse the memory map here, it means we cannot bind VFIO driver with some PCI devices with 64KB kernel page size. I hope we can support such case in DPDK. 😊

Hi Tone,

If it works and doesn't impact any other cases, i'm happy to include the 
above change (i.e. fall back to unaligned offset if aligning it results 
in zero offset). I'm curious what would happen if there's something else 
after the MSI-X table as well, and what alignments would be required for 
that... But i would rather wait for someone to come to us with an actual 
test case for that as well :)

> 
>>
>> --
>> Thanks,
>> Anatoly
>>
> 
> 
> --
> Thanks,
> Anatoly
> 


-- 
Thanks,
Anatoly


More information about the dev mailing list