[dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode before mapping

Maxime Coquelin maxime.coquelin at redhat.com
Thu Jul 6 12:59:04 CEST 2017



On 07/06/2017 11:49 AM, Jerin Jacob wrote:
> -----Original Message-----
>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>> From: Maxime Coquelin <maxime.coquelin at redhat.com>
>> To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>> CC: Santosh Shukla <santosh.shukla at caviumnetworks.com>,
>>   thomas at monjalon.net, bruce.richardson at intel.com, dev at dpdk.org,
>>   hemant.agrawal at nxp.com, shreyansh.jain at nxp.com, gaetan.rivet at 6wind.com
>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>   before mapping
>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.1.0
>>
>>
>>
>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>> -----Original Message-----
>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>> From: Maxime Coquelin <maxime.coquelin at redhat.com>
>>>> To: Santosh Shukla <santosh.shukla at caviumnetworks.com>,
>>>>    thomas at monjalon.net, bruce.richardson at intel.com, dev at dpdk.org
>>>> CC: jerin.jacob at caviumnetworks.com, hemant.agrawal at nxp.com,
>>>>    shreyansh.jain at nxp.com, gaetan.rivet at 6wind.com
>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>    before mapping
>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>    Thunderbird/52.1.0
>>>>
>>>>
>>>>
>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>
>>>>> Signed-off-by: Santosh Shukla<santosh.shukla at caviumnetworks.com>
>>>>> Signed-off-by: Jerin Jacob<jerin.jacob at caviumnetworks.com>
>>>>> ---
>>>>>     lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> index 04914406f..348b7a7f4 100644
>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>     		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>     		dma_map.vaddr = ms[i].addr_64;
>>>>>     		dma_map.size = ms[i].len;
>>>>> -		dma_map.iova = ms[i].phys_addr;
>>>>> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>> +			dma_map.iova = dma_map.vaddr;
>>>>> +		else
>>>>> +			dma_map.iova = ms[i].phys_addr;
>>>>>     		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>
>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>
>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>
>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>> used in separate processes. Each process could try two different
>>>> physical addresses at the same virtual address, and so the second map
>>>> would fail.
>>>
>>> IMO, Doesn't look like a problem. Here is the data flow
>>>
>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>> on primary process
>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>
>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>> that, the Secondary process has the _same_ virtual address as primary or
>>> exit from on attach.
>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>
>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>> in the page table in OS. On SMMU entry miss(When device
>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>> "context" with page tables from MMU.
>>
>> Ok thanks for the detailed info, but what about the case where the same
>> iommu group is used by two primary processes?
> 
> Does that case exist with DPDK? We always need to blacklist same BDF in
> the secondary process to make things work with existing DPDK setup. Which
> make sense as well. Only primary process configures the HW blocks.

I meant the case when two BDF are in the same IOMMU group (if ACS is not
supported at some point in the hierarchy). And I meant two primary
processes running, like for example two containers running each a DPDK
application.

Maybe this is not a valid use-case (it is not secure, as it would break
isolation between the two containers), but it seems that it is something
DPDK allows today, if I'm not mistaken.

>>
>> I don't know how frequent it is, but if ACS is not supported by either the
>> endpoint or the the root port, then you would have to share the same IOMMU
>> group for all the ports of your card. Right?
> 
> ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
> comes as a IOMMU group.
> 
> If it in bypass mode anyway you use in vfio-noiommu mode as
> there is no protection anyway.
Yes.

>>
>>> Let me add the background for why this feature is required in DPDK to
>>> enable NPU style co-processors.
>>>
>>> The traditional NICs the Rx path code look like this:
>>> 1) On control path, Fill the mempool with buffers
>>> 2) on rx_burst(), alloc the mbuf from mempool
>>> 3) SW has the mbuf in hand(which is a virtual address) and program the
>>> HW with mbuf->buf_physaddr)
>>> 4) Return the last pushed mbuf(will be updated by HW by now)
>>>
>>>
>>> On NPU style co-processors, situation is different as the buffer recycling
>>> has been done in HW unlike SW model. Here is the data flow:
>>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
>>> address, which is PA in existing model)
>>> 2) on rx_burst, HW gives you IOVA address(as address as step 1)
>>> 3) As application expects VA to operate on it, rx_burst() needs to
>>> convert to VA from PAA. Which is very costly.
>>> Instead with this IOVA as VA scheme, We can avoid the cost of converting
>>> with help of IOMMU/SMMU.
>>>
>>> This patch set auto detects the mode based available of type devices in
>>> bus and provides an option to override mode based on eal argument, so we
>>> don't foresee any issue with this approach and welcome any alternative
>>> approaches.
>>
>> I don't question the need of the feature for these kind of
>> co-processors, using VA as IOVA in your case seems very valid.
>>
>> What concerns me is that we change the default behavior for all other
>> devices. Having an option to override is fine to me, but the default
>> mode should remain the same IMHO.
> 
> Doesn't seems to be a technical point. But I agree with your concern.
> we will address it.
> I think, we have two ways to address it.
> 
> option 1:
> - In existing patch,
> a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
>    http://dpdk.org/dev/patchwork/patch/25192
> b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
> value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
> http://dpdk.org/dev/patchwork/patch/25193/
> check the code after rte_bus_scan()
> 
> option 2:
> On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
> we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
> mode so other type of devices safe.
> 
> I think, option 2 makes sense, as it gives foolproof auto detection scheme and
> without effecting any other devices that not interested in this scheme
> 
> Does that address your concern about the patchset?

Yes it does, or maybe create a new flag in struct rte_pci_driver's
drv_flags to provide a hint it prefers to use VA as IOVA?

It, of course, would just be a hint, and should be set only when other
conditions are met.

>> Wouldn't it be possible to default to VA as IOVA only when an HW mempool
>> is in use?
> 
> It will be too late as in the normal scheme of things, application
> creates the pool.

OK, makes sense.

Thanks,
Maxime

>>
>>> Similar problem exists in another part of the code in DPDK,
>>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
>>> Its a conditional compilation based approach with duplicating the vfio
>>> code and we are trying to fix the problem in a generic way so that
>>> everyone can get benefited out of it.
>>>
>>> Comments are welcome.
>>
>> Thanks,
>> Maxime
>>
>>> /Jerin
>>>
>>>>
>>>> By using physical addresses, you are safe against this problem.
>>>>
>>>> Any thoughts?
>>>>
>>>> Cheers,
>>>> Maxime


More information about the dev mailing list