[dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode before mapping

santosh santosh.shukla at caviumnetworks.com
Thu Jul 6 13:19:06 CEST 2017


On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:

>
> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>> -----Original Message-----
>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>> From: Maxime Coquelin <maxime.coquelin at redhat.com>
>>> To: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>>> CC: Santosh Shukla <santosh.shukla at caviumnetworks.com>,
>>>   thomas at monjalon.net, bruce.richardson at intel.com, dev at dpdk.org,
>>>   hemant.agrawal at nxp.com, shreyansh.jain at nxp.com, gaetan.rivet at 6wind.com
>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>   before mapping
>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>   Thunderbird/52.1.0
>>>
>>>
>>>
>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>> -----Original Message-----
>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>> From: Maxime Coquelin <maxime.coquelin at redhat.com>
>>>>> To: Santosh Shukla <santosh.shukla at caviumnetworks.com>,
>>>>>    thomas at monjalon.net, bruce.richardson at intel.com, dev at dpdk.org
>>>>> CC: jerin.jacob at caviumnetworks.com, hemant.agrawal at nxp.com,
>>>>>    shreyansh.jain at nxp.com, gaetan.rivet at 6wind.com
>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>    before mapping
>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>    Thunderbird/52.1.0
>>>>>
>>>>>
>>>>>
>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>
>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla at caviumnetworks.com>
>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob at caviumnetworks.com>
>>>>>> ---
>>>>>>     lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> index 04914406f..348b7a7f4 100644
>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>             dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>             dma_map.vaddr = ms[i].addr_64;
>>>>>>             dma_map.size = ms[i].len;
>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>> +        else
>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>             dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>>
>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>
>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>
>>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>>> used in separate processes. Each process could try two different
>>>>> physical addresses at the same virtual address, and so the second map
>>>>> would fail.
>>>>
>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>
>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>>> on primary process
>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>>
>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>>> that, the Secondary process has the _same_ virtual address as primary or
>>>> exit from on attach.
>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>>
>>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>>> in the page table in OS. On SMMU entry miss(When device
>>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>>> "context" with page tables from MMU.
>>>
>>> Ok thanks for the detailed info, but what about the case where the same
>>> iommu group is used by two primary processes?
>>
>> Does that case exist with DPDK? We always need to blacklist same BDF in
>> the secondary process to make things work with existing DPDK setup. Which
>> make sense as well. Only primary process configures the HW blocks.
>
> I meant the case when two BDF are in the same IOMMU group (if ACS is not
> supported at some point in the hierarchy). And I meant two primary
> processes running, like for example two containers running each a DPDK
> application.
>
> Maybe this is not a valid use-case (it is not secure, as it would break
> isolation between the two containers), but it seems that it is something
> DPDK allows today, if I'm not mistaken.
>
I'm not sure how two primary process could run, as because latter primary process
would try accessing /var/run/.rte_config and would fail at this [1] point.

It's not valid use-case for dpdk (imo).
[1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204

>>>
>>> I don't know how frequent it is, but if ACS is not supported by either the
>>> endpoint or the the root port, then you would have to share the same IOMMU
>>> group for all the ports of your card. Right?
>>
>> ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
>> comes as a IOMMU group.
>>
>> If it in bypass mode anyway you use in vfio-noiommu mode as
>> there is no protection anyway.
> Yes.
>
>>>
>>>> Let me add the background for why this feature is required in DPDK to
>>>> enable NPU style co-processors.
>>>>
>>>> The traditional NICs the Rx path code look like this:
>>>> 1) On control path, Fill the mempool with buffers
>>>> 2) on rx_burst(), alloc the mbuf from mempool
>>>> 3) SW has the mbuf in hand(which is a virtual address) and program the
>>>> HW with mbuf->buf_physaddr)
>>>> 4) Return the last pushed mbuf(will be updated by HW by now)
>>>>
>>>>
>>>> On NPU style co-processors, situation is different as the buffer recycling
>>>> has been done in HW unlike SW model. Here is the data flow:
>>>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
>>>> address, which is PA in existing model)
>>>> 2) on rx_burst, HW gives you IOVA address(as address as step 1)
>>>> 3) As application expects VA to operate on it, rx_burst() needs to
>>>> convert to VA from PAA. Which is very costly.
>>>> Instead with this IOVA as VA scheme, We can avoid the cost of converting
>>>> with help of IOMMU/SMMU.
>>>>
>>>> This patch set auto detects the mode based available of type devices in
>>>> bus and provides an option to override mode based on eal argument, so we
>>>> don't foresee any issue with this approach and welcome any alternative
>>>> approaches.
>>>
>>> I don't question the need of the feature for these kind of
>>> co-processors, using VA as IOVA in your case seems very valid.
>>>
>>> What concerns me is that we change the default behavior for all other
>>> devices. Having an option to override is fine to me, but the default
>>> mode should remain the same IMHO.
>>
>> Doesn't seems to be a technical point. But I agree with your concern.
>> we will address it.
>> I think, we have two ways to address it.
>>
>> option 1:
>> - In existing patch,
>> a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
>>    http://dpdk.org/dev/patchwork/patch/25192
>> b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
>> value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
>> http://dpdk.org/dev/patchwork/patch/25193/
>> check the code after rte_bus_scan()
>>
>> option 2:
>> On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
>> we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
>> mode so other type of devices safe.
>>
>> I think, option 2 makes sense, as it gives foolproof auto detection scheme and
>> without effecting any other devices that not interested in this scheme
>>
>> Does that address your concern about the patchset?
>
> Yes it does, or maybe create a new flag in struct rte_pci_driver's
> drv_flags to provide a hint it prefers to use VA as IOVA?
>
> It, of course, would just be a hint, and should be set only when other
> conditions are met.
>
>>> Wouldn't it be possible to default to VA as IOVA only when an HW mempool
>>> is in use?
>>
>> It will be too late as in the normal scheme of things, application
>> creates the pool.
>
> OK, makes sense.
>
> Thanks,
> Maxime
>
>>>
>>>> Similar problem exists in another part of the code in DPDK,
>>>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
>>>> Its a conditional compilation based approach with duplicating the vfio
>>>> code and we are trying to fix the problem in a generic way so that
>>>> everyone can get benefited out of it.
>>>>
>>>> Comments are welcome.
>>>
>>> Thanks,
>>> Maxime
>>>
>>>> /Jerin
>>>>
>>>>>
>>>>> By using physical addresses, you are safe against this problem.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Maxime



More information about the dev mailing list