[dpdk-dev] [PATCH 0/6] introduce DMA memory mapping for external memory

Alejandro Lucero alejandro.lucero at netronome.com
Thu Feb 14 13:27:38 CET 2019


On Thu, Feb 14, 2019 at 12:22 PM Alejandro Lucero <
alejandro.lucero at netronome.com> wrote:

>
>
> On Wed, Feb 13, 2019 at 7:24 PM Shahaf Shuler <shahafs at mellanox.com>
> wrote:
>
>> Wednesday, February 13, 2019 1:43 PM, Alejandro Lucero:
>> > Subject: Re: [dpdk-dev] [PATCH 0/6] introduce DMA memory mapping for
>> > external memory
>> >
>> > On Wed, Feb 13, 2019 at 9:11 AM Shahaf Shuler <shahafs at mellanox.com>
>> > wrote:
>> >
>> > > This series is in continue to RFC[1].
>> > >
>> > > The DPDK APIs expose 3 different modes to work with memory used for
>> > DMA:
>> > >
>> > > 1. Use the DPDK owned memory (backed by the DPDK provided
>> > hugepages).
>> > > This memory is allocated by the DPDK libraries, included in the DPDK
>> > > memory system (memseg lists) and automatically DMA mapped by the
>> > DPDK
>> > > layers.
>> > >
>> > > 2. Use memory allocated by the user and register to the DPDK memory
>> > > systems. This is also referred as external memory. Upon registration
>> > > of the external memory, the DPDK layers will DMA map it to all needed
>> > > devices.
>> > >
>> > > 3. Use memory allocated by the user and not registered to the DPDK
>> > > memory system. This is for users who wants to have tight control on
>> > > this memory. The user will need to explicitly call DMA map function in
>> > > order to register such memory to the different devices.
>> > >
>> > > The scope of the patch focus on #3 above.
>> > >
>> > >
>> > Why can not we have case 2 covering case 3?
>>
>> Because it is not our choice rather the DPDK application.
>> We could not allow it, and force the application to register their
>> external memory to the DPDK memory management system. However IMO it will
>> be wrong.
>> The use case exists  - some application wants to manage their memory by
>> themselves. w/o the extra overhead of rte_malloc, without creating a
>> special socket to populate the memory and without redundant API calls to
>> rte_extmem_*.
>>
>> Simply allocate chunk of memory, DMA map it to device and that’s it.
>>
>>
> Usability is a strong point, but up to some extent. DPDK is all about
> performance, and adding options the user can choose from will add pressure
> and complexity for keeping the performance. Your proposal makes sense from
> an user point of view, but will it avoid to modify things in the DPDK core
> for supporting this case broadly in the future? Multiprocess will be hard
> to get, if not impossible, without adding more complexity, and although you
> likely do not expect that use case requiring multiprocess support, once we
> have DPDK apps using this model, sooner or later those companies with
> products based on such option will demand broadly support. I can foresee
> not just multiprocess support will require changes in the future.
>
> This reminds me the case of obtaining real time: the more complexity the
> less determinism can be obtained. It is not impossible, simply it is far
> more complex. Pure real time operating systems can add new functionalities,
> but it is hard to do it properly without jeopardising the main goal.
> Generic purpose operating systems can try to improve determinism, but up to
> some extent and with important complexity costs. DPDK is the real time
> operating system in this comparison.
>
>
>> >
>> >
>> > > Currently the only way to map external memory is through VFIO
>> > > (rte_vfio_dma_map). While VFIO is common, there are other vendors
>> > > which use different ways to map memory (e.g. Mellanox and NXP).
>> > >
>> > >
>> > As you say, VFIO is common, and when allowing DMAs programmed in user
>> > space, the right thing to do.
>>
>> It is common indeed. Why it the right thing to do?
>>
>>
> Compared with UIO, for sure. VFIO does have the right view of the system
> in terms of which devices can properly be isolated. Can you confirm a
> specific implementation by a vendor can ensure same behaviour? If so, do
> you have duplicated code then? if the answer is your are using VFIO data,
> why not to use VFIO as the interface and add the required connection
> between VFIO and drivers?
>
> What about mapping validation? is the driver doing that part or relying on
> kernel code? or is it just assuming the mapping is safe?
>
>
>>  I'm assuming there is an IOMMU hardware and
>> > this is what Mellanox and NXP rely on in some way or another.
>>
>> For Mellanox, the device works with virtual memory, not physical. If you
>> think of it, it is more secure for user space application. Mellanox device
>> has internal memory translation unit between virtual memory and physical
>> memory.
>> IOMMU can be added on top of it, in case the host doesn't trust the
>> device or the device is given to untrusted entity like VM.
>>
>>
> Any current NIC or device will work with virtual addresses if IOMMU is in
> place, not matter if the device is  IOMMU-aware or not. Any vendor, with
> that capability in their devices, should follow generic paths and a common
> interface with the vendor drivers being the executors. The drivers know how
> to tell the device, but they should be told what to tell and not by the
> user but by the kernel.
>
> I think reading your comment "in case the host doesn't trust the device"
> makes easier to understand what you try to obtain, and at the same time
> makes my concerns not a problem at all. This is a case for DPDK being used
> in certain scenarios where the full system is trusted, what I think is a
> completely rightful option. My only concern then is the complexity it could
> imply sooner or later, although I have to admit it is not a strong one :-)
>
>

I forgot to mention the problem of leaving that option open in not fully
trusted systems. I do not know how it could be avoided, maybe some checks
in EAL initialization, but maybe this is not possible at all. Anyway, I
think this is worth to be discussed further.


> >
>> > Having each driver doing things in their own way will end up in a
>> harder to
>> > validate system.
>>
>> Different vendors will have different HW implementations. We cannot force
>> everybody to align the IOMMU.
>> What we can do, is to ease the user life and provide vendor agnostic APIs
>> which just provide the needed functionality. On our case DMA map and unmap.
>> The user should not care if its IOMMU, Mellanox memory registration
>> through verbs or NXP special mapping.
>>
>> The sys admin should set/unset the IOMMU as a general mean of protection.
>> And this of course will work also w/ Mellanox devices.
>>
>> If there is an IOMMU hardware, same mechanism should be
>> > used always, leaving to the IOMMU hw specific implementation to deal
>> with
>> > the details. If a NIC is IOMMU-able, that should not  be supported by
>> specific
>> > vendor drivers but through a generic solution like VFIO which will
>> validate a
>> > device with such capability and to perform the required actions for
>> that case.
>> > VFIO and IOMMU should be modified as needed for supporting this
>> > requirement instead of leaving vendor drivers to implement their own
>> > solution.
>>
>> Again - I am against of forcing every PCI device to use VFIO, and I don't
>> think IOMMU as a HW device should control other PCI devices.
>> I see nothing wrong with device which also has extra capabilities of
>> memory translation, and adds another level of security to the user
>> application.
>>
>>
> In a system with untrusted components using the device, a generic way of
> properly configure the system with the right protections should be used
> instead of relying on specific vendor implementation.
>
>
>> >
>> > In any case, I think this support should be in a different patchset
>> than the
>> > private user space mappings.
>> >
>> >
>
> >
>> > > The work in this patch moves the DMA mapping to vendor agnostic APIs.
>> > > A new map and unmap ops were added to rte_bus structure.
>> > > Implementation of those was done currently only on the PCI bus. The
>> > > implementation takes the driver map and umap implementation as bypass
>> > to the VFIO mapping.
>> > > That is, in case of no specific map/unmap from the PCI driver, VFIO
>> > > mapping, if possible, will be used.
>> > >
>> > > Application use with those APIs is quite simple:
>> > > * allocate memory
>> > > * take a device, and query its rte_device.
>> > > * call the bus map function for this device.
>> > >
>> > > Future work will deprecate the rte_vfio_dma_map and
>> > rte_vfio_dma_unmap
>> > > APIs, leaving the PCI device APIs as the preferred option for the
>> user.
>> > >
>> > > [1]
>> > >
>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpat
>> > >
>> > ches.dpdk.org%2Fpatch%2F47796%2F&data=02%7C01%7Cshahafs%40
>> > mellanox
>> > >
>> > .com%7Cdc209a2ceace48c0452008d691a8762d%7Ca652971c7d2e4d9ba6a4d1
>> > 49256f
>> > >
>> > 461b%7C0%7C0%7C636856550053348339&sdata=3TEUJfS9jBOsbvaPYwo
>> > itQLj7o
>> > > h9VCrtaK7We%2FItg5c%3D&reserved=0
>> > >
>> > > Shahaf Shuler (6):
>> > >   vfio: allow DMA map of memory for the default vfio fd
>> > >   vfio: don't fail to DMA map if memory is already mapped
>> > >   bus: introduce DMA memory mapping for external memory
>> > >   net/mlx5: refactor external memory registration
>> > >   net/mlx5: support PCI device DMA map and unmap
>> > >   doc: deprecate VFIO DMA map APIs
>> > >
>> > >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>> > >  doc/guides/rel_notes/deprecation.rst            |   4 +
>> > >  drivers/bus/pci/pci_common.c                    |  78 +++++++
>> > >  drivers/bus/pci/rte_bus_pci.h                   |  14 ++
>> > >  drivers/net/mlx5/mlx5.c                         |   2 +
>> > >  drivers/net/mlx5/mlx5_mr.c                      | 232
>> ++++++++++++++++---
>> > >  drivers/net/mlx5/mlx5_rxtx.h                    |   5 +
>> > >  lib/librte_eal/common/eal_common_bus.c          |  22 ++
>> > >  lib/librte_eal/common/include/rte_bus.h         |  57 +++++
>> > >  lib/librte_eal/common/include/rte_vfio.h        |  12 +-
>> > >  lib/librte_eal/linuxapp/eal/eal_vfio.c          |  26 ++-
>> > >  lib/librte_eal/rte_eal_version.map              |   2 +
>> > >  12 files changed, 418 insertions(+), 38 deletions(-)
>> > >
>> > > --
>> > > 2.12.0
>> > >
>> > >
>>
>


More information about the dev mailing list