[dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA Engines

Fu, Patrick patrick.fu at intel.com
Fri Apr 17 09:26:34 CEST 2020


Background
====================================
DPDK vhost library implements a user-space VirtIO net backend allowing host applications to directly communicate with VirtIO front-end in VMs and containers. However, every vhost enqueue/dequeue operation requires to copy packet buffers between guest and host memory. The overhead of copying large bulk of data makes the vhost backend become the I/O bottleneck. DMA engines, including un-core DMA accelerator, like Crystal Beach DMA (CBDMA) and Data Streaming Accelerator (DSA), and discrete card general purpose DMA, are extremely efficient in data movement within system memory. Therefore, we propose a set of asynchronous DMA data movement API in vhost library for DMA acceleration. With offloading packet copies in vhost data-path from the CPU to the DMA engine, which can not only accelerate data transfers, but also save precious CPU core resources.

New API Overview
====================================
The proposed APIs in the vhost library support various DMA engines to accelerate data transfers in the data-path. For the higher performance, DMA engines work in an asynchronous manner, where DMA data transfers and CPU computations are executed in parallel. The proposed API consists of control path API and data path API. The control path API includes Registration API and DMA operation callback, and the data path API includes asynchronous API. To remove the dependency of vendor specific DMA engines, the DMA operation callback provides generic DMA data transfer abstractions. To support asynchronous DMA data movement, the new async API provides asynchronous ring operation semantic in data-path. To enable/disable DMA acceleration for virtqueues, users need to use registration API is to register/unregister DMA callback implementations to the vhost library and bind DMA channels to virtqueues. The DMA channels used by virtqueues are provided by DPDK applications, which is backed by  virtual or physical DMA devices.
The proposed APIs are consisted of 3 sub-sets:
1. DMA Registration APIs
2. DMA Operation Callbacks
3. Async Data APIs

DMA Registration APIs
==================================== 
DMA acceleration is per queue basis. DPDK applications need to explicitly decide whether a virtqueue needs DMA acceleration and which DMA channel to use. In addition, a DMA channel is dedicated to a virtqueue and a DMA channel cannot be bound to multiple virtqueues at the same time. To enable DMA acceleration for a virtqueue, DPDK applications need to implement DMA operation callbacks for a specific DMA type (e.g. CBDMA) first, then register the callbacks to the vhost library and bind a DMA channel to a virtqueue, and finally use the new async API to perform data-path operations on the virtqueue.
The definitions of registration API are shown below:
int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
					struct rte_vdma_device_ops *ops);

int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);

The "rte_vhost_async_channel_register" is to register implemented DMA operation callbacks to the vhost library and bind a DMA channel to a virtqueue. DPDK applications must implement the corresponding DMA operation callbacks for various DMA engines. To enable DMA acceleration for a virtqueue, DPDK applications need to explicitly call "rte_vhost_async_channel_register" for the virtqueue.  The "ops" points to the implementation of callbacks. 
The "rte_vhost_async_channel_unregister" unregisters DMA operation callbacks and unbind the DMA channel from the virtqueue. If a virtqueue does not bind to a DMA channel, it will use SW data-path without DMA acceleration.

DMA Operation Callbacks
==================================== 
The definitions of DMA operation callback are shown below:
struct iovec {	/** this is kernel uapi structure */
	void *iov_base;	/** buffer address */
	size_t iov_len;	/** buffer length */
};

struct iov_iter {	
	size_t iov_offset;
	size_t count;		/** total bytes of a packet */
	struct iovec *iov;	/** array of data buffers */
	unsigned long nr_segs;	/** number of iovec structures */
	uintptr_t usr_data;	/** app specific memory handler*/
};

struct dma_trans_desc {
	struct iov_iter *src; /** source memory iov_iter*/
	struct iov_iter *dst; /** destination memory iov_iter*/
};

struct dma_trans_status {
	uintptr_t src_usr_data; /** trans completed memory handler*/
	uintptr_t dst_usr_data; /** trans completed memory handler*/
};

struct rte_vhost_async_channel_ops {
	/** Instruct a DMA channel to perform copies for a batch of packets */
	int (*transfer_data)( struct dma_trans_desc *descs,
				 uint16_t count);

        	/** check copy-completed packets from a DMA channel */
	int (*check_completed_copies)( struct dma_trans_status *usr_data,
					uint16_t max_packets);
};

The first callback "transfer_data" is to submit a batch of packet copies to a DMA channel. As a packet's source or destination buffer can be a vector of buffers or a single data stream, we use "struct dma_trans_desc" to construct the source and destination buffer of packet.  Copying a packet is to move data from source iov_iter structure to destination iov_iter structure. The "count" is the number of packets to do copy. 
The second callback "check_completed_copies" queries the completion status of the DMA. An "usr_data" member variable is embedded in "iov_iter" structure, which serves as a unique identifier of the memory region described by "iov_iter". As the source/destination buffer can be scatter-gather, the DMA channel may perform its copies out-of-order. When all copies of an iov_iter are completed by the DMA channel, the "check_completed_copies" should return the associated "usr_data" by "dma_trans_status" structure. 

Async Data APIs
==================================== 
The definitions of new enqueue API are shown below:
uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id, struct rte_mbuf **pkts, uint16_t count);

uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id, struct rte_mbuf **pkts, uint16_t count);

The "rte_vhost_submit_enqueue_burst" is to enqueue a batch of packets to a virtqueue with giving ownership of enqueue packets to the vhost library. DPDK applications cannot reuse the enqueued packets until they get back the ownership. For a virtqueue enabled DMA acceleration by the "rte_vhost_async_channel_register", the "rte_vhost_submit_enqueue_burst" will use the bound DMA channel to perform packet copies; moreover, the function is non-blocking, which just submits packet copies to the DMA channel but without waiting for completion. For a virtqueue without enabling DMA acceleration, the "rte_vhost_submit_enqueue_burst" will use SW data-path, where the CPU performs packet copies. It worth noticing that DPDK applications cannot directly reuse enqueued packet buffers by "rte_vhost_submit_enqueue_burst", even if it uses SW data-path.

The "rte_vhost_poll_enqueue_completed" returns ownership for the packets whose copies are all completed currently, either by the DMA channel or the CPU. It is a non-blocking function, which will not wait for DMA copies completion. After getting back the ownership of packets enqueued by "rte_vhost_submit_enqueue_burst", DPDK applications can further process the packet buffers, e.g. free pktmbufs.

Sample Work Flow
==================================== 
Some DMA engines, like CBDMA, need to use physical addresses and do not support I/O page fault. In addition, some guests may want to avoid memory swapping out. For these cases, we can pin guest memory by setting a new flag "RTE_VHOST_USER_DMA_COPY" in rte_vhost_driver_register(). Here is an example of how to use CBDMA to accelerate vhost enqueue operation:
Step1: Implement DMA operation callbacks for CBDMA via IOAT PMD
Step2: call rte_vhost_driver_register with flag "RTE_VHOST_USER_DMA_COPY" (pin guest memory)
Step3: call rte_vhost_async_channel_register to register DMA channel
Step4: call rte_vhost_submit_enqueue_burst to enqueue packets
Step5: call rte_vhost_poll_enqueue_completed get back the ownership of the packets whose copies are completed
Step6: call rte_pktmbuf_free to free packet mbuf

Signed-off-by: Patrick Fu <patrick.fu at intel.com>
Signed-off-by: Jiayu Hu <jiayu.hu at intel.com> 



More information about the dev mailing list