[dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD

Maxime Coquelin maxime.coquelin at redhat.com
Thu Mar 19 10:10:28 CET 2020


Hi Jiayu,

On 3/19/20 8:33 AM, Hu, Jiayu wrote:
> Hi Maxime,
> 
> Thanks for your comments. Replies are inline.
> 
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin at redhat.com>
>> Sent: Tuesday, March 17, 2020 5:54 PM
>> To: Hu, Jiayu <jiayu.hu at intel.com>; dev at dpdk.org
>> Cc: Ye, Xiaolong <xiaolong.ye at intel.com>; Wang, Zhihong
>> <zhihong.wang at intel.com>
>> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
>> user PMD
>>
>> Hi Jiayu,
>>
>> On 3/17/20 10:21 AM, Jiayu Hu wrote:
>>> In vhost-user PMD's Tx operations, where data movement is heavily
>> involved,
>>> performing large memory copies usually takes up a major part of CPU
>> cycles
>>> and becomes the hot spot. To offload expensive memory operations from
>> the
>>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a DMA
>>> engine in the Intel's processor, to accelerate large copies for vhost-user.
>>>
>>> Large copies are offloaded from the CPU to the DMA in an asynchronous
>>> manner. The CPU just submits copy jobs to the DMA but without waiting
>>> for its copy completion. Thus, there is no CPU intervention during data
>>> transfer; we can save precious CPU cycles and improve the overall
>>> throughput for vhost-user PMD based applications, like OVS. During
>>> packet transmission, it offloads large copies to the DMA and performs
>>> small copies by the CPU, due to startup overheads associated with the DMA.
>>>
>>> vhost-user PMD is able to support various DMA engines, but it just
>>> supports I/OAT devices currently. In addition, I/OAT acceleration is only
>>> enabled for Tx operations of split rings. Users can explicitly assign a
>>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
>> device
>>> can only be used by one queue, and a queue can use one I/OAT device at a
>>> time.
>>>
>>> We measure the performance in testpmd. With 1024 bytes packets,
>> compared
>>> with the original SW data path, DMA-enabled vhost-user PMD can improve
>>> the throughput around 20%~30% in the VM2VM and PVP cases.
>> Furthermore,
>>> with larger packets, the throughput improvement will be higher.
>>
>>
>> I'm not sure it should be done like that for several reasons.
>>
>> First, it seems really complex for the user to get the command line
>> right. There is no mention in the doc patch on how to bind the DMAs to
>> the DPDK application. Are all the DMAs on the system capable of doing
>> it?
> 
> DMA engines in Intel CPU are able to move data within system memory.
> Currently, we have I/OAT and we will have DSA in the future.

OK, can you give me an example of how many I/OAT instances on a given
CPU?

>> I think it should be made transparent to the user, who should not have
>> to specify the DMA device address in command line. The user should just
>> pass a devarg specifying he wants to use DMAs, if available.
> 
> How do you think of replacing DMA address with specific DMA capabilities, like
> "dmas=[txq0 at DMACOPY]". As I/OAT only supports data movement, we can
> just provide basic DMA copy ability now. But when there are more DMA devices,
>  we can add capabilities in devargs later.
"dmas=[txq0 at DMACOPY]" is still too complex IMHO. We should just have a
flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used
for all queues as we do for zero-copy.

>>
>> Second, it looks too much vendor-specific. IMHO, we should have a DMA
>> framework, so that the driver can request DMA channels based on
>> capabilities.
> 
> We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
> a rawdev. IMO, it will be very hard to provide a generic DMA abstraction
> currently. In addition, I/OAT specific API is called inside vhost-user PMD,
> we can replace these function calls when we have a DMA framework in
> the future. Users are unaware of the changes. Does it make sense to you?

Having an abstraction might be hard, but it does not seem impossible.
Such DMA abstraction has been done in the Kernel for IOAT. For example:
https://lore.kernel.org/patchwork/cover/56714/

>>
>> Also, I don't think implementing ring processing in the Vhost PMD is
>> welcome, Vhost PMD should just be a wrapper for the Vhost library. Doing
>> that in Vhost PMD causes code duplication, and will be a maintenance
>> burden on the long run.
>>
>> As IOAT is a kind of acceleration, why not implement it through the vDPA
>> framework? vDPA framework should be extended to support this kind of
>> acceleration which requires some CPU processing, as opposed to full
>> offload of the ring processing as it only supports today.
> 
> The main reason of implementing data path in vhost PMD is to avoid impacting
> SW data path in vhost library. Even if we implement it as an instance of vDPA,
> we also have to implement data path in a new vdev PMD, as DMA just accelerates
> memory copy and all ring operations have to be done by the CPU. There is still the
> code duplication issue.

Ok, so what about:

Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
dequeue.

lib/librte_vhost/ioat.c which would implement dma_enqueue and
dma_dequeue callback for IOAT. As it will live in the vhost lib
directory, it will be easy to refactor the code to share as much as
possible and so avoid code duplication.

In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
call it instead of the SW datapath. It adds a few cycle, but this is
much more sane IMHO.

What do you think?

Thanks,
Maxime
> Thanks,
> Jiayu
> 
>>
>> Kind regards,
>> Maxime
>>
>>> Jiayu Hu (4):
>>>   vhost: populate guest memory for DMA-accelerated vhost-user
>>>   net/vhost: setup vrings for DMA-accelerated datapath
>>>   net/vhost: leverage DMA engines to accelerate Tx operations
>>>   doc: add I/OAT acceleration support for vhost-user PMD
>>>
>>>  doc/guides/nics/vhost.rst         |  14 +
>>>  drivers/Makefile                  |   2 +-
>>>  drivers/net/vhost/Makefile        |   6 +-
>>>  drivers/net/vhost/internal.h      | 160 +++++++
>>>  drivers/net/vhost/meson.build     |   5 +-
>>>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
>>>  drivers/net/vhost/virtio_net.c    | 861
>> ++++++++++++++++++++++++++++++++++++++
>>>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
>>>  lib/librte_vhost/rte_vhost.h      |   1 +
>>>  lib/librte_vhost/socket.c         |  20 +
>>>  lib/librte_vhost/vhost.h          |   2 +
>>>  lib/librte_vhost/vhost_user.c     |   3 +-
>>>  12 files changed, 1597 insertions(+), 73 deletions(-)
>>>  create mode 100644 drivers/net/vhost/internal.h
>>>  create mode 100644 drivers/net/vhost/virtio_net.c
>>>  create mode 100644 drivers/net/vhost/virtio_net.h
>>>
> 



More information about the dev mailing list