[dpdk-dev] dmadev discussion summary

Jerin Jacob jerinjacobk at gmail.com
Thu Jul 1 17:01:00 CEST 2021


On Sat, Jun 26, 2021 at 9:29 AM fengchengwen <fengchengwen at huawei.com> wrote:
>
> Hi, all
>   I analyzed the current DPAM DMA driver and drew this summary in conjunction
> with the previous discussion, and this will as a basis for the V2 implementation.
>   Feedback is welcome, thanks

Thanks for the write-up.

>
> dpaa2_qdma:
>   [probe]: mainly obtains the number of hardware queues.
>   [dev_configure]: has following parameters:
>       max_hw_queues_per_core:
>       max_vqs: max number of virt-queue
>       fle_queue_pool_cnt: the size of FLE pool
>   [queue_setup]: setup up one virt-queue, has following parameters:
>       lcore_id:
>       flags: some control params, e.g. sg-list, longformat desc, exclusive HW
>              queue...
>       rbp: some misc field which impact the descriptor
>       Note: this API return the index of virt-queue which was successful
>             setuped.
>   [enqueue_bufs]: data-plane API, the key fields:
>       vq_id: the index of virt-queue
>           job: the pointer of job array
>           nb_jobs:
>           Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
>             the flag field indicate whether src/dst is PHY addr.
>   [dequeue_bufs]: get the completed jobs's pointer
>
>   [key point]:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                       core/rawdev
>       1) In the probe stage, driver tell how many HW-queues could use.
>       2) User could specify the maximum number of HW-queues managed by a single
>          core in the dev_configure stage.
>       3) User could create one virt-queue by queue_setup API, the virt-queue has
>          two types: a) exclusive HW-queue, b) shared HW-queue(as described
>          above), this is achieved by the corresponding bit of flags field.
>       4) In this mode, queue management is simplified. User do not need to
>          specify the HW-queue to be applied for and create a virt-queue on the
>          HW-queue. All you need to do is say on which core I want to create a
>          virt-queue.
>       5) The virt-queue could have different capability, e.g. virt-queue-0
>          support scatter-gather format, and virt-queue-1 don't support sg, this
>          was control by flags and rbp fields in queue_setup stage.
>       6) The data-plane API use the definition similar to rte_mbuf and
>          rte_eth_rx/tx_burst().
>       PS: I still don't understand how sg-list enqueue/dequeue, and user how to
>           use RTE_QDMA_VQ_NO_RESPONSE.
>
>       Overall, I think it's a flexible design with many scalability. Especially
>       the queue resource pool architecture, simplifies user invocations,
>       although the 'core' introduces a bit abruptly.
>
>
> octeontx2_dma:
>   [dev_configure]: has one parameters:
>       chunk_pool: it's strange why it's not managed internally by the driver,
>                   but passed in through the API.
>   [enqueue_bufs]: has three important parameters:
>       context: this is what Jerin referred to 'channel', it could hold the
>                completed ring of the job.
>       buffers: hold the pointer array of dpi_dma_buf_ptr_s
>       count: how many dpi_dma_buf_ptr_s
>           Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter-
>             gather list), and has one completed_ptr (when HW complete it will
>             write one value to this ptr), current the completed_ptr pointer
>             struct:
>                 struct dpi_dma_req_compl_s {
>                     uint64_t cdata;  --driver init and HW update result to this.
>                     void (*compl_cb)(void *dev, void *arg);
>                     void *cb_data;
>                 };
>   [dequeue_bufs]: has two important parameters:
>       context: driver will scan it's completed ring to get complete info.
>       buffers: hold the pointer array of completed_ptr.
>
>   [key point]:
>       -----------    -----------
>       | channel |    | channel |
>       -----------    -----------
>              \           /
>               \         /
>                \       /
>              ------------
>              | HW-queue |
>              ------------
>                    |
>                 --------
>                 |rawdev|
>                 --------
>       1) User could create one channel by init context(dpi_dma_queue_ctx_s),
>          this interface is not standardized and needs to be implemented by
>          users.
>       2) Different channels can support different transmissions, e.g. one for
>          inner m2m, and other for inbound copy.
>
>       Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
>       The difference is that dpaa2_qdma supports multiple hardware queues. The
>       'channel' has following

If dpaa2_qdma supports more than one HW queue, I think, it is good to
have the queue notion
in DPDK just like other DPDK device classes. It will be good to have
confirmation from dpaa2 folks, @Hemant Agrawal,
if there are really more than 1 HW queue in dppa device.


IMO, Channel is a better name than a virtual queue. The reason is,
virtual queue is more
implementation-specific notation. No need to have this in API specification.


>       1) A channel is an operable unit at the user level. User can create a
>          channel for each transfer type, for example, a local-to-local channel,
>          and a local-to-host channel. User could also get the completed status
>          of one channel.
>       2) Multiple channels can run on the same HW-queue. In terms of API design,
>          this design reduces the number of data-plane API parameters. The
>          channel could has context info which will referred by data-plane APIs
>          execute.
>
>
> ioat:
>   [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues.
>   [dev_configure]: has three parameters:
>       ring_size: the HW descriptor size.
>       hdls_disable: whether ignore user-supplied handle params
>       no_prefetch_completions:
>   [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters.
>   [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
>                             src_hdls/dst_hdls parameters.
>
>   Overall, one HW-queue one rawdev, and don't have many 'channel' which similar
>   to octeontx2_dma.
>
>
> Kunpeng_dma:
>   1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/
>      pciehost-to-local/immediated-to-local copy).
>      Note: Currently, we only implement local-to-local copy.
>   2) The hardmware support multiple HW-queues.
>
>
> Summary:
>   1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
>      x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
>      e.g. local-to-host/local-to-host..., from the point of view of API design,
>      I think we should adopt a similar 'channel' or 'virt-queue' concept.

+1 for analysis.

>   2) Whether to create a separate dmadev for each HW-queue? We previously
>      discussed this, and due HW-queue could indepent management (like
>      Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
>      HW-queue before. But I'm not sure if that's the case with dpaa. I think
>      that can be left to the specific driver, no restriction is imposed on the
>      framework API layer.

+1

>   3) I think we could setup following abstraction at dmadev device:
>       ------------    ------------
>       |virt-queue|    |virt-queue|
>       ------------    ------------
>              \           /
>               \         /
>                \       /
>              ------------     ------------
>              | HW-queue |     | HW-queue |
>              ------------     ------------
>                     \            /
>                      \          /
>                       \        /
>                         dmadev

other than name virt-queue vs channel. +1

>   4) The driver's ops design (here we only list key points):
>      [dev_info_get]: mainly return the number of HW-queues

max number of channel/virt queues too.

>      [dev_configure]: nothing important

Need to say a number of channel/virt-queues needs to configure for this device.

>      [queue_setup]: create one virt-queue, has following main parameters:
>          HW-queue-index: the HW-queue index used
>          nb_desc: the number of HW descriptors
>          opaque: driver's specific info
>          Note1: this API return virt-queue index which will used in later API.
>                 If user want create multiple virt-queue one the same HW-queue,
>                 they could achieved by call queue_setup with the same
>                 HW-queue-index.
>          Note2: I think it's hard to define queue_setup config paramter, and
>                 also this is control API, so I think it's OK to use opaque
>                 pointer to implement it.
>       [dma_copy/memset/sg]: all has vq_id input parameter.
>          Note: I notice dpaa can't support single and sg in one virt-queue, and
>                I think it's maybe software implement policy other than HW
>                restriction because virt-queue could share the same HW-queue.
>       Here we use vq_id to tackle different scenario, like local-to-local/
>       local-to-host and etc.

IMO, The index representation has an additional overhead as one needs
to translate it
to memory pointer. I prefer to avoid by having object handle and use
_lookup() API get to make it work
in multi-process cases to avoid the additional indirection. Like mempool object.

>   5) And the dmadev public data-plane API (just prototype):
>      dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
>        -- flags: used as an extended parameter, it could be uint32_t
>      dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
>      dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
>        -- sg: struct dma_scatterlist array
>      uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
>                                    uint16_t nb_cpls, bool *has_error)
>        -- nb_cpls: indicate max process operations number
>        -- has_error: indicate if there is an error
>        -- return value: the number of successful completed operations.
>        -- example:
>           1) If there are already 32 completed ops, and 4th is error, and
>              nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
>              has_error will be true.
>           2) If there are already 32 completed ops, and all successful
>              completed, then the ret will be min(32, nb_cpls), and has_error
>              will be false.
>           3) If there are already 32 completed ops, and all failed completed,
>              then the ret will be 0, and has_error will be true.

+1. IMO, it is better to call ring_idx instead of a cookie. To enforce
that it the ring index.

>      uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
>                                           uint16_t nb_status, uint32_t *status)
>        -- return value: the number of failed completed operations.

See above. Here we are assuming it is an index otherwise we need to
pass an array
cookies.

>      And here I agree with Morten: we should design API which adapts to DPDK
>      service scenarios. So we don't support some sound-cards DMA, and 2D memory
>      copy which mainly used in video scenarios.
>   6) The dma_cookie_t is signed int type, when <0 it mean error, it's
>      monotonically increasing base on HW-queue (other than virt-queue). The
>      driver needs to make sure this because the damdev framework don't manage
>      the dma_cookie's creation.

+1 and see above.

>   7) Because data-plane APIs are not thread-safe, and user could determine
>      virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
>      duty to ensure thread-safe.

+1. But I am not sure how easy for the fast-path application to have this logic,
Instead, I think, it is better to tell the capa for queue by driver
and in channel configuration,
the application can request for requirement (Is multiple producers enq
to the same HW queue or not).
Based on the request, the implementation can pick the correct function
pointer for enq.(lock vs lockless version if HW does not support
lockless)


>   8) One example:
>      vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
>      if (vq_id < 0) {
>         // create virt-queue failed
>         return;
>      }
>      // submit memcpy task
>      cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
>      if (cookie < 0) {
>         // submit failed
>         return;
>      }
>      // get complete task
>      ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
>      if (!has_error && ret == 1) {
>         // the memcpy successful complete
>      }

+1

>   9) As octeontx2_dma support sg-list which has many valid buffers in
>      dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.

+1

>   10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
>       only support create one virt-queue.
>   11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
>       for dpaa2_qdma guys feedback.
>   12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
>       two candidates which are iova and void *, how about introduce dma_addr_t
>       type which could be va or iova ?

I think, conversion looks ugly, better to have void * and share the
constraints of void *
as limitation/capability using flag. So that driver can update it.


>


More information about the dev mailing list