[dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for external memory

Burakov, Anatoly anatoly.burakov at intel.com
Thu Nov 22 11:41:57 CET 2018


On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
> Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
>> external memory
>>
>> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
>>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
>>>>>>>>
>>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>>>> Hi Anatoly,
>>>>>>>>>
>>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>>>> external memory
>>>>>>>>>>
>>>
>>> [...]
> 
> [...]
> 
>>> I think it was broken in some way when you did the first work on the new
>> memory management.
>>> Because after this work, PMDs were adjusted to look for the memory only
>> in the fbarrays, and such external memory doesn't exists there.
>>> otherwise, if working, why you added the support for the external
>> memory? Which other use case you tried to cover?
>>>
>>> It is fixed in some way with the new work to support external memory, but
>> requires changes in the application (the "overhead" I referred before).
>>
>> I feel we have a difference of terminology here.
>>
>> What i refer to as "external memory support" is the feature that was added
>> in 18.11, that has to do with creating heaps etc. - this memory is now
>> (hopefully) properly supported across entire DPDK.
>>
>> What you're probably referring to as "external memory support" is not that -
>> it's the feature of mempools, rings etc. that allow you to use memory that's
>> not registered with and not managed by DPDK. This is the way "external
>> memory" was done prior to 18.11.
> 
> I thought the 18.11 feature should serve both cases.
> To use memory not managed by DPDK user should create heap and populate it w/ the memory, so that it can use it later for mempool/rings.

The feature introduced in 18.11 expects you to use DPDK allocation 
methods with added memory. It is not a supported use case to create an 
external memory area, register it with heap *and* use manual memory 
management on that area (e.g. populate mempool anonymously).

Perhaps there should be a new API, something like "register_memory", 
where we only register memory, but do not expect the user to use DPDK 
allocator with said memory. I wish i would've thought of that before, 
where have you been all this time! :D I think this should feature 
definitely go into 19.02. Are we past the v1 deadline?

> 
>>
>> The above scenario (working with anonymous memory in mempools) is still
>> "broken" in this sense - and i see no general way to fix it. It could work with
>> mempools in that we could automatically create new heaps behind the
>> scenes and populate them with new memory, but for other data structures
>> (e.g. rte_ring_init() over anonymous memory), it's just not possible to do
>> that.
>>
>> I personally would prefer these API's to just go away and people started
>> using the "official" way to manage external memory. Removing the API's may
>> even be feasible for everything bar mempool, and mempool support we can
>> manage more-or-less automatically in the above way (by automatically
>> creating an external heap behind the scenes). But then by the time we're
>> creating/adding external memory heaps inside mempool API's, you don't
>> really need the external memory DMA mapping API calls because the
>> mempool would do that automatically by way of using the external heaps
>> API.
> 
> Yes, one way to go is to map the memory internally inside the mempool creation. But it will not work for all cases.
> For example, if application wants to work w/ mbuf with external buffer (EXT_ATTACHED_MBUF), the buffer will be attached on a different call (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
> Maybe we will find more cases in the future. i would prefer to solve it for all cases.  See more below.
> 
>>
>> In other words, the problem here isn't so much usability as it is the fact that,
>> for some stuff, a lot of internal code now relies on DPDK's internal page
>> tables, and we have to either make allowances for that on a case-by-case
>> basis (by the way, no one came to me with these issues!),
> 
> One option is case by case, other can be to allow the application to explicit map this memory to the different devices.
> 
> or admit that using
>> unregistered memory is a bad idea and let's not do that in the future.
> 
> I think it is un-avoidable (to use unregistered memory). The use case exists, there are applications which prefer to have their own memory management.
> For those apps, we need to have a solution.
> 
>>
>>>
>>>>
>>>> On the other hand, if you were to use it in a different way - for
>>>> example, allocating hash tables or other DPDK data structures - then
>>>> such a feature is essential.
>>>
>>> Hash tables that the NIC needs to read/write to/from?
>>> I didn't get the point here.
>>
>> The point wasn't hash tables, but support for use cases we might not have
>> thought of. That includes doing strange things like allocating DPDK data
>> structures in non-standard memory locations.
>>
>>>
>>> The entire point was to allow using external memory with
>>>> semantics identical to how you use the rest of DPDK.
>>>
>>> I understand this design choice, and it has the benefit of consistency v.s.
>> more API call on the application side.
>>>
>>>>
>>>> Also, whether it's "intuitive" depends on perspective - you say "i
>>>> expect to allocate memory and map it for DMA myself", i say "why do i
>>>> have to care about DMA mapping, DPDK should do this for me" :)
>>>
>>> Right 😊. We should be driven by customers and not vendors, too bad there
>> is no much feedback from them on the community.
>>
>> That implies current design *wasn't* driven by customers, which is not true.
>> They may not be the same as ones you are aware of - but then this isn't really
>> my fault :)
>>
>>> I know from VPP side they prefer not to use the rte_socket rather to
>> manage the memory by them self, and they are perfectly fine with mapping
>> it explicitly. This is true for other company that has vswitch solution (which I
>> cannot mention by name).
>>
>> Well, it's not like i can do anything if no one from either communities comes
>> here and engages in the discussion. I wish i could read people's minds, but
>> alas :)
>>
>>>
>>>
>>> If you are using your
>>>> own memory management and doing everything manually - you get to
>> map
>>>> everything for DMA manually. If you are using DPDK facilities
>>>> - it is intuitive that for any DPDK-managed memory, internal or
>>>> external, same rules apply across the board - DMA mapping included.
>>>
>>> So if understand you point here, your claim "we need both". One for users
>> which prefer the DPDK to do that for them and the other for the users who
>> wants more control.
>>> However, when keeping the API of explicit mapping, we can no longer say
>> that external memory is behaves exactly the same as internal.
>>
>> Well, more or less that, yes. It would be good to have full support for
>> "external" memory (in the "unregistered with DPDK" sense, as outlined
>> above).
>>
>>>
>>> So going back to my RFC 😊, if we decide to keep both approach, what I
>> suggest is to change the VFIO mapping one to be more generic and per
>> device (can be rte_device).
>>> I think the biggest questions we need to answer are 1. are we OK with
>>> generic, vendor agnostic API?
>>
>> Yes, generic and vendor-agnostic FTW!
>>
>>> 2.  are we OK that the explicit mapping is done per device?
>>
>> This question is slightly misleading, because that's not *all* there is to it. In
>> general, yes, i'm OK with the idea. In practice, VFIO may get in the way.
>>
>> VFIO works by mapping memory for a VFIO container. The container can
>> have one or more devices in it, and how many devices it can have depends
>> on the IOMMU setup (i.e. it may not be possible to have one device per
>> container on some setups, which is why currently everything is done with
>> one container).
> 
> 
>   So, "per-device" means we either keep each device in
>> separate container (which will not work in some cases),
> or we somehow
>> work around the fact that mappings can be performed only once (i'm not
>> sure if VFIO will return appropriate error codes that would allow us to
>> distinguish failed mappings due to duplicate, or failed mappings due to other
>> reasons - i'll have to research this), and keep all devices in the same
>> container.
> 
> Yes I see here 2 options:
> 1. the VFIO has good error reports that enables the user to understand mapping failed because it is already exists.
> 2. to hold database which will say which IOVA were already mapped so the DPDK vfio mapping will know to avoid it.

Option 2 is difficult to support with secondary processes. Secondary 
processes also use the same container and thus should have the same view 
of what is mapped and what isn't. Given that VFIO mapping may happen 
during memory allocation, we won't be able to allocate more memory to 
hold extra mappings, so shared memory won't work here. We could use 
fbarray for this, but that would make the amount of mappings fixed.

> 
>>
>> In other words, it may be possible to solve this issue for VFIO, but i would not
>> be comfortable to proceed without knowing that this can be done in a sane
>> way.
> 
> +1.
> 
> I think the simplest is #1 above. How can we get an answer on it?

A few experiments should do it :) I'll get back to you on that.

> 
>>
>> There's also a question of automatic vs. manual mapping. The way i see it,
>> the choices are:
>>
>> 1) drop automatic registration for DMA altogether, for both internal and
>> external memory
>> 2) drop automatic registration for DMA for external memory only, but keep
>> automatic DMA registration for internal memory
>> 3) do automatic DMA registration for both
>>
>> I don't like the second choice for API consistency reasons, so i would rather
>> go either for 1) (and cause massive breakage of all existing applications along
>> the way...), or 3) (the way it's working right now).
>> The pragmatic choice *may* be 2), which is what i feel you are advocating
>> for, but it has its own problems, mainly the mechanics of it.
>> If we can get them worked out, i may just bite the bullet and ack the patches
>> 😃
> 
> I don't think we need to do such extensive changes.
> 
> I think it is ok to have 2 ways for application to work (the heaps and the explicit DMA call). After all, DPDK is a development kit and we should make application life easier for different use cases.
> If the application wants the DPDK to help, it can create the heap and everything will just happen. Otherwise application will do the memory management by itself, and use the DMA call explicitly.

In other words, what you are proposing is option 3), with an addition of 
support for manual memory/DMA mapping management by way of the new API 
proposed above (register_memory() call). Am i understanding you correctly?

> 
>>
>> Specifically, let's suppose we want automatic memory registration for DMA
>> on internal but not external memory. How does it happen? Currently, it
>> happens via mem event callbacks.
>>
>> If we were to drop the callback approach for external memory, we would
>> lose out on additional functionality provided by said callbacks, and i would
>> really like to avoid that.
>>
>> If we were to switch to manual mapping, that means EAL will depend on
>> ethdev (you have suggested rte_device as an alternative - would that fix the
>> issue?)
> 
> No it won't.
> It will help only for the cases were we have multiple devices on the same PCI address (like representors).

So, should it be an rte_device call then?

> 
> , and we will have to rewrite current DMA mapping code to do
>> mappings explicitly, rather than relying on callbacks mechanism. So, instead
>> of e.g. VFIO mapping being performed via callback, we would instead iterate
>> over each rte_device and call its respective DMA mapping function explicitly -
>> for internal memory, but not for external. This
>> *may* work, but i need to think it through a bit more :)
>>
>> This also still leaves a question of external unregistered memory - this is a
>> wholly separate issue.
>>
>> TL;DR i'm tentatively OK with the proposal, but we need to work certain
>> things out before we go too far along the path and find out that we've
>> reached a dead end.
> 
> Fully agree, this is why the RFC is for.
> I think our biggest question is whether or not VFIO will behave OK with double mapping or we need some tracking mechanism in DPDK.

Yes, i will check this for VFIO and get back to you.

> 
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly


More information about the dev mailing list