[dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD

Hu, Jiayu jiayu.hu at intel.com
Thu Mar 19 12:47:15 CET 2020
Previous message: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD
Next message: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin at redhat.com>
> Sent: Thursday, March 19, 2020 5:10 PM
> To: Hu, Jiayu <jiayu.hu at intel.com>; dev at dpdk.org
> Cc: Ye, Xiaolong <xiaolong.ye at intel.com>; Wang, Zhihong
> <zhihong.wang at intel.com>
> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
> user PMD
> 
> Hi Jiayu,
> 
> On 3/19/20 8:33 AM, Hu, Jiayu wrote:
> > Hi Maxime,
> >
> > Thanks for your comments. Replies are inline.
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin at redhat.com>
> >> Sent: Tuesday, March 17, 2020 5:54 PM
> >> To: Hu, Jiayu <jiayu.hu at intel.com>; dev at dpdk.org
> >> Cc: Ye, Xiaolong <xiaolong.ye at intel.com>; Wang, Zhihong
> >> <zhihong.wang at intel.com>
> >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for
> vhost-
> >> user PMD
> >>
> >> Hi Jiayu,
> >>
> >> On 3/17/20 10:21 AM, Jiayu Hu wrote:
> >>> In vhost-user PMD's Tx operations, where data movement is heavily
> >> involved,
> >>> performing large memory copies usually takes up a major part of CPU
> >> cycles
> >>> and becomes the hot spot. To offload expensive memory operations
> from
> >> the
> >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a
> DMA
> >>> engine in the Intel's processor, to accelerate large copies for vhost-user.
> >>>
> >>> Large copies are offloaded from the CPU to the DMA in an asynchronous
> >>> manner. The CPU just submits copy jobs to the DMA but without waiting
> >>> for its copy completion. Thus, there is no CPU intervention during data
> >>> transfer; we can save precious CPU cycles and improve the overall
> >>> throughput for vhost-user PMD based applications, like OVS. During
> >>> packet transmission, it offloads large copies to the DMA and performs
> >>> small copies by the CPU, due to startup overheads associated with the
> DMA.
> >>>
> >>> vhost-user PMD is able to support various DMA engines, but it just
> >>> supports I/OAT devices currently. In addition, I/OAT acceleration is only
> >>> enabled for Tx operations of split rings. Users can explicitly assign a
> >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
> >> device
> >>> can only be used by one queue, and a queue can use one I/OAT device at
> a
> >>> time.
> >>>
> >>> We measure the performance in testpmd. With 1024 bytes packets,
> >> compared
> >>> with the original SW data path, DMA-enabled vhost-user PMD can
> improve
> >>> the throughput around 20%~30% in the VM2VM and PVP cases.
> >> Furthermore,
> >>> with larger packets, the throughput improvement will be higher.
> >>
> >>
> >> I'm not sure it should be done like that for several reasons.
> >>
> >> First, it seems really complex for the user to get the command line
> >> right. There is no mention in the doc patch on how to bind the DMAs to
> >> the DPDK application. Are all the DMAs on the system capable of doing
> >> it?
> >
> > DMA engines in Intel CPU are able to move data within system memory.
> > Currently, we have I/OAT and we will have DSA in the future.
> 
> OK, can you give me an example of how many I/OAT instances on a given
> CPU?

One Xeon Platinum 8180 CPU has 8 I/OAT instances.

> 
> >> I think it should be made transparent to the user, who should not have
> >> to specify the DMA device address in command line. The user should just
> >> pass a devarg specifying he wants to use DMAs, if available.
> >
> > How do you think of replacing DMA address with specific DMA capabilities,
> like
> > "dmas=[txq0 at DMACOPY]". As I/OAT only supports data movement, we
> can
> > just provide basic DMA copy ability now. But when there are more DMA
> devices,
> >  we can add capabilities in devargs later.
> "dmas=[txq0 at DMACOPY]" is still too complex IMHO. We should just have a
> flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used
> for all queues as we do for zero-copy.
> 
> >>
> >> Second, it looks too much vendor-specific. IMHO, we should have a DMA
> >> framework, so that the driver can request DMA channels based on
> >> capabilities.
> >
> > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
> > a rawdev. IMO, it will be very hard to provide a generic DMA abstraction
> > currently. In addition, I/OAT specific API is called inside vhost-user PMD,
> > we can replace these function calls when we have a DMA framework in
> > the future. Users are unaware of the changes. Does it make sense to you?
> 
> Having an abstraction might be hard, but it does not seem impossible.
> Such DMA abstraction has been done in the Kernel for IOAT. For example:
> https://lore.kernel.org/patchwork/cover/56714/
> 
> >>
> >> Also, I don't think implementing ring processing in the Vhost PMD is
> >> welcome, Vhost PMD should just be a wrapper for the Vhost library.
> Doing
> >> that in Vhost PMD causes code duplication, and will be a maintenance
> >> burden on the long run.
> >>
> >> As IOAT is a kind of acceleration, why not implement it through the vDPA
> >> framework? vDPA framework should be extended to support this kind of
> >> acceleration which requires some CPU processing, as opposed to full
> >> offload of the ring processing as it only supports today.
> >
> > The main reason of implementing data path in vhost PMD is to avoid
> impacting
> > SW data path in vhost library. Even if we implement it as an instance of
> vDPA,
> > we also have to implement data path in a new vdev PMD, as DMA just
> accelerates
> > memory copy and all ring operations have to be done by the CPU. There is
> still the
> > code duplication issue.
> 
> Ok, so what about:
> 
> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
> dequeue.
> 
> lib/librte_vhost/ioat.c which would implement dma_enqueue and
> dma_dequeue callback for IOAT. As it will live in the vhost lib
> directory, it will be easy to refactor the code to share as much as
> possible and so avoid code duplication.
> 
> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
> call it instead of the SW datapath. It adds a few cycle, but this is
> much more sane IMHO.

The problem is that current semantics of rte_vhost_enqueue/dequeue API
are conflict with I/OAT accelerated data path. To improve the performance,
the I/OAT works in an asynchronous manner, where the CPU just submits
copy jobs to the I/OAT without waiting for its copy completion. For
rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it
returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
users will not get incoming packets as the I/OAT is still performing packet
copies. As you can see, when enabling I/OAT acceleration, the semantics of
the two API are changed. If we keep the same API name but changing their
semantic, this may confuse users, IMHO.

Thanks,
Jiayu

> 
> What do you think?
> 
> Thanks,
> Maxime
> > Thanks,
> > Jiayu
> >
> >>
> >> Kind regards,
> >> Maxime
> >>
> >>> Jiayu Hu (4):
> >>>   vhost: populate guest memory for DMA-accelerated vhost-user
> >>>   net/vhost: setup vrings for DMA-accelerated datapath
> >>>   net/vhost: leverage DMA engines to accelerate Tx operations
> >>>   doc: add I/OAT acceleration support for vhost-user PMD
> >>>
> >>>  doc/guides/nics/vhost.rst         |  14 +
> >>>  drivers/Makefile                  |   2 +-
> >>>  drivers/net/vhost/Makefile        |   6 +-
> >>>  drivers/net/vhost/internal.h      | 160 +++++++
> >>>  drivers/net/vhost/meson.build     |   5 +-
> >>>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
> >>>  drivers/net/vhost/virtio_net.c    | 861
> >> ++++++++++++++++++++++++++++++++++++++
> >>>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
> >>>  lib/librte_vhost/rte_vhost.h      |   1 +
> >>>  lib/librte_vhost/socket.c         |  20 +
> >>>  lib/librte_vhost/vhost.h          |   2 +
> >>>  lib/librte_vhost/vhost_user.c     |   3 +-
> >>>  12 files changed, 1597 insertions(+), 73 deletions(-)
> >>>  create mode 100644 drivers/net/vhost/internal.h
> >>>  create mode 100644 drivers/net/vhost/virtio_net.c
> >>>  create mode 100644 drivers/net/vhost/virtio_net.h
> >>>
> >
Previous message: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD
Next message: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list