[dpdk-dev] [PATCH 2/9] vhost: provide helpers for virtio ring relay

Wang, Xiao W xiao.w.wang at intel.com
Wed Dec 12 07:51:28 CET 2018


Hi,

> -----Original Message-----
> From: Bie, Tiwei
> Sent: Monday, December 3, 2018 10:23 PM
> To: Wang, Xiao W <xiao.w.wang at intel.com>
> Cc: maxime.coquelin at redhat.com; dev at dpdk.org; Wang, Zhihong
> <zhihong.wang at intel.com>; Ye, Xiaolong <xiaolong.ye at intel.com>
> Subject: Re: [PATCH 2/9] vhost: provide helpers for virtio ring relay
> 
> On Wed, Nov 28, 2018 at 05:46:00PM +0800, Xiao Wang wrote:
> [...]
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Synchronize the available ring from guest to mediate ring, help to
> > + * check desc validity to protect against malicious guest driver.
> > + *
> > + * @param vid
> > + *  vhost device id
> > + * @param qid
> > + *  vhost queue id
> > + * @param m_vring
> > + *  mediate virtio ring pointer
> > + * @return
> > + *  number of synced available entries on success, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vdpa_relay_avail_ring(int vid, int qid, struct vring *m_vring);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Synchronize the used ring from mediate ring to guest, log dirty
> > + * page for each Rx buffer used.
> > + *
> > + * @param vid
> > + *  vhost device id
> > + * @param qid
> > + *  vhost queue id
> > + * @param m_vring
> > + *  mediate virtio ring pointer
> > + * @return
> > + *  number of synced used entries on success, -1 on failure
> > + */
> > +int __rte_experimental
> > +rte_vdpa_relay_used_ring(int vid, int qid, struct vring *m_vring);
> 
> Above APIs are split ring specific. We also need to take
> packed ring into consideration.

After some study on the current packed ring description, several ideas:
1. These APIs are used as helpers to setup a mediate relay layer to help do dirty page logging, we may not need
 this kind of ring relay for packed ring at all. The target of a mediate SW layer is to help device do dirty page
 logging, so this SW-assisted VDPA tries to find a way to intercept the frontend-backend communication, as you
 can see in this patch set, SW captures the device interrupt and then parse the vring and log dirty page
 afterwards. We set up this mediate vring to make sure the relay SW can intercept the device interrupt, as you
 know, this way we can control the mediate vring's interrupt suppression structure.

2.One new point about the packed ring is that it separates out the event suppression structure from the
description ring. So in this case, we can just set up a mediate event suppression structure to intercept event
 notification.

BTW, I find one troublesome point about the packed ring is that it's hard for a mediate SW to quickly handle the
 "buffer id", guest virtio driver understands this id well, it keeps some internal info about each id, e.g. chain list
 length, but the relay SW has to parse the packed ring again, which is not efficient.

3. In the split vring, relay SW reuses the guest desc vring, and desc is not writed by DMA, so no log for the desc.
 But in the packed vring, desc is writed by DMA, desc ring's logging is a new thing.
Packed ring is quite different, it could be a very different mechanism, other than following a vring relay API. Also
 from testing point of view, if we come out with a new efficient implementation for packed ring VDPA, it's hard to
 test it with HW. Testing need a HW supporting packed ring DMA and the get_vring_base/set_vring_base
 interface.

> 
> >  #endif /* _RTE_VDPA_H_ */
> [...]
> > diff --git a/lib/librte_vhost/vdpa.c b/lib/librte_vhost/vdpa.c
> > index e7d849ee0..e41117776 100644
> > --- a/lib/librte_vhost/vdpa.c
> > +++ b/lib/librte_vhost/vdpa.c
> > @@ -122,3 +122,176 @@ rte_vdpa_get_device_num(void)
> >  {
> >  	return vdpa_device_num;
> >  }
> > +
> > +static int
> > +invalid_desc_check(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +		uint64_t desc_iova, uint64_t desc_len, uint8_t perm)
> > +{
> > +	uint64_t desc_addr, desc_chunck_len;
> > +
> > +	while (desc_len) {
> > +		desc_chunck_len = desc_len;
> > +		desc_addr = vhost_iova_to_vva(dev, vq,
> > +				desc_iova,
> > +				&desc_chunck_len,
> > +				perm);
> > +
> > +		if (!desc_addr)
> > +			return -1;
> > +
> > +		desc_len -= desc_chunck_len;
> > +		desc_iova += desc_chunck_len;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_vdpa_relay_avail_ring(int vid, int qid, struct vring *m_vring)
> > +{
> > +	struct virtio_net *dev = get_device(vid);
> > +	uint16_t idx, idx_m, desc_id;
> > +	struct vring_desc desc;
> > +	struct vhost_virtqueue *vq;
> > +	struct vring_desc *desc_ring;
> > +	struct vring_desc *idesc = NULL;
> > +	uint64_t dlen;
> > +	int ret;
> > +
> > +	if (!dev)
> > +		return -1;
> > +
> > +	vq = dev->virtqueue[qid];
> 
> Better to also validate qid.
> 
> > +	idx = vq->avail->idx;
> > +	idx_m = m_vring->avail->idx;
> > +	ret = idx - idx_m;
> 
> Need to cast (idx - idx_m) to uint16_t.
> 
> > +
> > +	while (idx_m != idx) {
> > +		/* avail entry copy */
> > +		desc_id = vq->avail->ring[idx_m % vq->size];
> 
> idx_m & (vq->size - 1) should be faster.
> 
> > +		m_vring->avail->ring[idx_m % vq->size] = desc_id;
> > +		desc_ring = vq->desc;
> > +
> > +		if (vq->desc[desc_id].flags & VRING_DESC_F_INDIRECT) {
> > +			dlen = vq->desc[desc_id].len;
> > +			desc_ring = (struct vring_desc *)(uintptr_t)
> > +			vhost_iova_to_vva(dev, vq, vq->desc[desc_id].addr,
> 
> The indent needs to be fixed.
> 
> > +						&dlen,
> > +						VHOST_ACCESS_RO);
> > +			if (unlikely(!desc_ring))
> > +				return -1;
> > +
> > +			if (unlikely(dlen < vq->desc[idx].len)) {
> > +				idesc = alloc_copy_ind_table(dev, vq,
> > +					vq->desc[idx].addr, vq->desc[idx].len);
> > +				if (unlikely(!idesc))
> > +					return -1;
> > +
> > +				desc_ring = idesc;
> > +			}
> > +
> > +			desc_id = 0;
> > +		}
> > +
> > +		/* check if the buf addr is within the guest memory */
> > +		do {
> > +			desc = desc_ring[desc_id];
> > +			if (invalid_desc_check(dev, vq, desc.addr, desc.len,
> > +						VHOST_ACCESS_RW))
> 
> Should check with < 0, otherwise should return bool.
> 
> We may just have RO access.

The desc may refers to a transmit buffer as well as receive buffer. Agree on the comments and nice catches elsewhere above, will send new version.

[...]

BRs,
Xiao


More information about the dev mailing list