[dpdk-dev] [dpdk-dev, 2/3] eth_dev: add support for device dma mask

Alejandro Lucero alejandro.lucero at netronome.com
Fri May 13 10:38:58 CEST 2016


On Thu, May 12, 2016 at 4:41 PM, Jan Viktorin <viktorin at rehivetech.com>
wrote:

> Hi,
>
> Just a note, please, when replying inline, do not prepend ">" before your
> new text. I could not find your replies.
>
>
My gmail interface does not show that prepend character... It seems I have
to leave a white line before my replies.



> See below...
>
> On Thu, 12 May 2016 16:03:14 +0100
> Alejandro Lucero <alejandro.lucero at netronome.com> wrote:
>
> > Hi Jan
> >
> > On Thu, May 12, 2016 at 3:52 PM, Jan Viktorin <viktorin at rehivetech.com>
> > wrote:
> >
> > > Hello Alejandro,
> > >
> > > On Thu, 12 May 2016 15:33:59 +0100
> > > "Alejandro.Lucero" <alejandro.lucero at netronome.com> wrote:
> > >
> > > > - New dma_mask field in rte_eth_dev_data.
> > > >  - If PMD sets device dma_mask, call to check hugepages within
> > >
> > > I think that one of the purposes of DPDK is to support DMA transfers
> > > in userspace. In other words, I can see no reason to support dma_mask
> > > at the ethdev level only.
> > >
> > > The limitation is a device limitation so I can not see a better place
> for
> > adding the device dma mask.
>
> That's what I've meant. It is a _device_ limitation. The ethdev is a
> wrapper
> around the rte_pci_device. The ethdev just extends semantics of the
> generic device.
> However, all DPDK devices are expected to be DMA-capable.
>
> If you get a pointer to the ethdev, you get a pointer to the
> rte_pci_device as well
> (1 more level of dereference but we are not on the fast path here, so it's
> unimportant).
>
> Consider the cryptodev. If cryptodev has some DMA mask requirements we can
> support it
> in the generic place, i.e. rte_pci_device and not rte_ethdev because the
> cryptodev
> is not an ethdev.
>
>
Ok. I was wrongly assuming we had just ethdevs, with the ethdev being the
generic and rte_pci_device being a type of ethdev.

I can add the dma mask to the rte_pci_dev. The extra level of dereference
is not a problem as long as we do not use that dma mask for a more complex
allocation API (more about this later).

If I understand it right, work is in progress for adding a rte_device. I
can not see a problem with adding dma mask to the rte_device struct either.



> >
> >
> > > We should consider adding this to the generic struct rte_device
> > > (currently rte_pci_device). Thomas?
> > >
> > > I guess it could be a non-pci device with such a limitation. I though
> > rte_ethdev is more generic.
>
> When it is added to the rte_pci_device (or rte_device after the planned
> changes)
> the non-PCI devices get this for free... Otherwise I don't understand the
> point
> here.
>
> >
> >
> > > >    supported range.
> > >
> > > I think, the '-' is unnecessary at the beginning of line. As for me
> > > I prefer a fluent text describing the purpose. The '-' is useful for
> > > a real list of notes.
> > >
> > > >
> > > > Signed-off-by: Alejandro Lucero <alejandro.lucero at netronome.com>
> > > >
> > > > ---
> > > > lib/librte_ether/rte_ethdev.c | 7 +++++++
> > > >  lib/librte_ether/rte_ethdev.h | 1 +
> > > >  2 files changed, 8 insertions(+)
> > > >
> > > > diff --git a/lib/librte_ether/rte_ethdev.c
> > > b/lib/librte_ether/rte_ethdev.c
> > > > index a31018e..c0de88a 100644
> > > > --- a/lib/librte_ether/rte_ethdev.c
> > > > +++ b/lib/librte_ether/rte_ethdev.c
> > > > @@ -280,9 +280,16 @@ rte_eth_dev_init(struct rte_pci_driver *pci_drv,
> > > >
> > > >       /* Invoke PMD device initialization function */
> > > >       diag = (*eth_drv->eth_dev_init)(eth_dev);
> > > > +     if (diag)
> > > > +             goto err;
> > > > +
> > > > +     if (eth_dev->data->dma_mask)
> > > > +             diag =
> > > rte_eal_hugepage_check_address_mask(eth_dev->data->dma_mask);
> > > > +
> > >
> > > I don't understand what happens if the memory is out of the DMA mask.
> What
> > > can the driver
> > > do? Does it just fail?
> > >
> > > I think that this should be considered during a malloc instead. (Well,
> > > there is probably
> > > no suitable API for this at the moment.)
> > >
> > > hugepage memory allocation is done before device initialization. I see
> > easier to leave the normal hugepage code as it is now and add a later
> call
> > if a device requires it.
>
> True. I didn't meant to change hugepages allocation but to change
> allocation
> of memory for a device. If we reserve a set of hugepages, a subset of them
> can
> match the DMA mask for a particular device. When allocating memory for a
> device,
> we can consider the DMA mask and choose only the hugepages we can handle.
>
> Quite frankly, I am not very aware of the memory allocation internals of
> DPDK so
> I can be wrong...
>
>
Of course, we could use a device dma mask smartly, and for ethdevs, to
allocate device TX and RX rings and mbufs just from mempools with hugepages
in the supported range. Any memory allocation by the app not related to
packet handling could use another mempool with no dma mask restriction at
all. This implies awareness by the app in some way, what I guess it would
preferable to avoid.

In our case, NFP PMD support, the restriction is 40bits which implies
potential problems with machines above 1TB of memory. These type of
machines are not likely to be used with our card, and even with those
machines a DPDK app could work smoothly while allocated hugepages are below
the 1TB. Even more, DPDK in VMs inside those machines will not have
problems because the IOMMU (assuming SRIOV) will map the guest physical
address which will be (not 100% sure about this for any hypervisor) below
1TB (assuming VMs do not get more than 1TB each, of course). So, a solution
where this is managed "roughly" with an error if an allocated hugepage is
out of range is acceptable for us. We first thought to just check the
amount of memory in the system inside our PMD and return an error if more
than 1TB but I prefer to raise an error just if we really need to regarding
the allocated hugepages.

I understand this could not be the case for other devices with more
likeliness to have problems, and then a memory allocation API using devices
dma mask makes sense. We could consider this a second step to support
device limitation addresses and to implement that if really needed.



> >
> > The only reasonable thing to do is to fail as the amount of required
> memory
> > can not be (safely) allocated.
>
> This is the easiest way and it is not bad. However, what would be a
> workaround?
> If a user gets some allocated memory but out of the DMA mask, how can she
> fix it?
> I don't know... Is it deterministic? Would it always allocate memory out
> of the
> mask or only sometimes?
>
> Jan
>
> >
> >
> > > Regards
> > > Jan
> > >
> > > >       if (diag == 0)
> > > >               return 0;
> > > >
> > > > +err:
> > > >       RTE_PMD_DEBUG_TRACE("driver %s: eth_dev_init(vendor_id=0x%u
> > > device_id=0x%x) failed\n",
> > > >                       pci_drv->name,
> > > >                       (unsigned) pci_dev->id.vendor_id,
> > > > diff --git a/lib/librte_ether/rte_ethdev.h
> > > b/lib/librte_ether/rte_ethdev.h
> > > > index 2757510..34daa92 100644
> > > > --- a/lib/librte_ether/rte_ethdev.h
> > > > +++ b/lib/librte_ether/rte_ethdev.h
> > > > @@ -1675,6 +1675,7 @@ struct rte_eth_dev_data {
> > > >       enum rte_kernel_driver kdrv;    /**< Kernel driver passthrough
> */
> > > >       int numa_node;  /**< NUMA node connection */
> > > >       const char *drv_name;   /**< Driver name */
> > > > +     uint64_t dma_mask; /** device supported address space range */
> > > >  };
> > > >
> > > >  /** Device supports hotplug detach */
> > >
> > >
> > >
> > > --
> > >    Jan Viktorin                  E-mail: Viktorin at RehiveTech.com
> > >    System Architect              Web:    www.RehiveTech.com
> > >    RehiveTech
> > >    Brno, Czech Republic
> > >
>
>
>
> --
>    Jan Viktorin                  E-mail: Viktorin at RehiveTech.com
>    System Architect              Web:    www.RehiveTech.com
>    RehiveTech
>    Brno, Czech Republic
>


More information about the dev mailing list