[dpdk-dev] [PATCH v5 1/6] ethdev: introduce Rx buffer split

Ferruh Yigit ferruh.yigit at intel.com
Wed Oct 14 00:34:16 CEST 2020


On 10/13/2020 8:21 PM, Viacheslav Ovsiienko wrote:
> The DPDK datapath in the transmit direction is very flexible.
> An application can build the multi-segment packet and manages
> almost all data aspects - the memory pools where segments
> are allocated from, the segment lengths, the memory attributes
> like external buffers, registered for DMA, etc.
> 
> In the receiving direction, the datapath is much less flexible,
> an application can only specify the memory pool to configure the
> receiving queue and nothing more. In order to extend receiving
> datapath capabilities it is proposed to add the way to provide
> extended information how to split the packets being received.
> 
> The following structure is introduced to specify the Rx packet
> segment:
> 
> struct rte_eth_rxseg {
>      struct rte_mempool *mp; /* memory pools to allocate segment from */
>      uint16_t length; /* segment maximal data length,
> 		       	configures "split point" */
>      uint16_t offset; /* data offset from beginning
> 		       	of mbuf data buffer */
>      uint32_t reserved; /* reserved field */
> };
> 
> The segment descriptions are added to the rte_eth_rxconf structure:
>     rx_seg - pointer the array of segment descriptions, each element
>               describes the memory pool, maximal data length, initial
>               data offset from the beginning of data buffer in mbuf.
> 	     This array allows to specify the different settings for
> 	     each segment in individual fashion.
>     n_seg - number of elements in the array
> 
> If the extended segment descriptions is provided with these new
> fields the mp parameter of the rte_eth_rx_queue_setup must be
> specified as NULL to avoid ambiguity.
> 
> The new offload flag RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT in device
> capabilities is introduced to present the way for PMD to report to
> application about supporting Rx packet split to configurable
> segments. Prior invoking the rte_eth_rx_queue_setup() routine
> application should check RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT flag.
> 
> If the Rx queue is configured with new routine the packets being
> received will be split into multiple segments pushed to the mbufs
> with specified attributes. The PMD will split the received packets
> into multiple segments according to the specification in the
> description array:
> 
> - the first network buffer will be allocated from the memory pool,
>    specified in the first segment description element, the second
>    network buffer - from the pool in the second segment description
>    element and so on. If there is no enough elements to describe
>    the buffer for entire packet of maximal length the pool from the
>    last valid element will be used to allocate the buffers from for the
>    rest of segments
> 
> - the offsets from the segment description elements will provide
>    the data offset from the buffer beginning except the first mbuf -
>    for this one the offset is added to the RTE_PKTMBUF_HEADROOM to get
>    actual offset from the buffer beginning. If there is no enough
>    elements to describe the buffer for entire packet of maximal length
>    the offsets for the rest of segment will be supposed to be zero.
> 
> - the data length being received to each segment is limited  by the
>    length specified in the segment description element. The data
>    receiving starts with filling up the first mbuf data buffer, if the
>    specified maximal segment length is reached and there are data
>    remaining (packet is longer than buffer in the first mbuf) the
>    following data will be pushed to the next segment up to its own
>    maximal length. If the first two segments is not enough to store
>    all the packet remaining data  the next (third) segment will
>    be engaged and so on. If the length in the segment description
>    element is zero the actual buffer size will be deduced from
>    the appropriate memory pool properties. If there is no enough
>    elements to describe the buffer for entire packet of maximal
>    length the buffer size will be deduced from the pool of the last
>    valid element for the remaining segments.
> 
> For example, let's suppose we configured the Rx queue with the
> following segments:
>      seg0 - pool0, len0=14B, off0=2
>      seg1 - pool1, len1=20B, off1=128B
>      seg2 - pool2, len2=20B, off2=0B
>      seg3 - pool3, len3=512B, off3=0B
> 
> The packet 46 bytes long will look like the following:
>      seg0 - 14B long @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B long @ 128 in mbuf from pool1
>      seg2 - 12B long @ 0 in mbuf from pool2
> 
> The packet 1500 bytes long will look like the following:
>      seg0 - 14B @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0
>      seg1 - 20B @ 128 in mbuf from pool1
>      seg2 - 20B @ 0 in mbuf from pool2
>      seg3 - 512B @ 0 in mbuf from pool3
>      seg4 - 512B @ 0 in mbuf from pool3
>      seg5 - 422B @ 0 in mbuf from pool3
> 
> The offload RTE_ETH_RX_OFFLOAD_SCATTER must be present and
> configured to support new buffer split feature (if n_seg
> is greater than one).
> 
> The new approach would allow splitting the ingress packets into
> multiple parts pushed to the memory with different attributes.
> For example, the packet headers can be pushed to the embedded
> data buffers within mbufs and the application data into
> the external buffers attached to mbufs allocated from the
> different memory pools. The memory attributes for the split
> parts may differ either - for example the application data
> may be pushed into the external memory located on the dedicated
> physical device, say GPU or NVMe. This would improve the DPDK
> receiving datapath flexibility with preserving compatibility
> with existing API.
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo at nvidia.com>
> ---
>   doc/guides/nics/features.rst             | 15 +++++
>   doc/guides/rel_notes/release_20_11.rst   |  9 +++
>   lib/librte_ethdev/rte_ethdev.c           | 95 ++++++++++++++++++++++++--------
>   lib/librte_ethdev/rte_ethdev.h           | 58 ++++++++++++++++++-
>   lib/librte_ethdev/rte_ethdev_version.map |  1 +

Can you please update deprecation notice too, to remove the notice?

>   5 files changed, 155 insertions(+), 23 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index dd8c955..a45a9e8 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -185,6 +185,21 @@ Supports receiving segmented mbufs.
>   * **[related]    eth_dev_ops**: ``rx_pkt_burst``.
>   
>   
> +.. _nic_features_buffer_split:
> +
> +Buffer Split on Rx
> +------------------
> +
> +Scatters the packets being received on specified boundaries to segmented mbufs.
> +
> +* **[uses]       rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.

uses 'rx_seg and rx_nseg' from 'rte_eth_rxconf', perhaps it can be another line.

> +* **[implements] datapath**: ``Buffer Split functionality``.
> +* **[implements] rte_eth_dev_data**: ``buffer_split``.

What is implemented here?

> +* **[provides]   rte_eth_dev_info**: ``rx_offload_capa:RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT``.
> +* **[provides]   eth_dev_ops**: ``rxq_info_get:buffer_split``.

Is this correct?

> +* **[related] API**: ``rte_eth_rx_queue_setup()``.
> +
> +
>   .. _nic_features_lro:
>   
>   LRO
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index bcc0fc2..f67ec62 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -60,6 +60,12 @@ New Features
>     Added the FEC API which provides functions for query FEC capabilities and
>     current FEC mode from device. Also, API for configuring FEC mode is also provided.
>   
> +* **Introduced extended buffer description for receiving.**
> +
> +  Added the extended Rx queue setup routine providing the individual

This looks wrong with last version.

> +  descriptions for each Rx segment with maximal size, buffer offset and memory
> +  pool to allocate data buffers from.
> +
>   * **Updated Broadcom bnxt driver.**
>   
>     Updated the Broadcom bnxt driver with new features and improvements, including:
> @@ -253,6 +259,9 @@ API Changes
>     As the data of ``uint8_t`` will be truncated when queue number under
>     a TC is greater than 256.
>   
> +* ethdev: Added fields rx_seg and rx_nseg to rte_eth_rxconf structure
> +  to provide extended description of the receiving buffer.
> +
>   * vhost: Moved vDPA APIs from experimental to stable.
>   
>   * rawdev: Added a structure size parameter to the functions
> diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
> index 892c246..7b64a6e 100644
> --- a/lib/librte_ethdev/rte_ethdev.c
> +++ b/lib/librte_ethdev/rte_ethdev.c
> @@ -105,6 +105,9 @@ struct rte_eth_xstats_name_off {
>   #define RTE_RX_OFFLOAD_BIT2STR(_name)	\
>   	{ DEV_RX_OFFLOAD_##_name, #_name }
>   
> +#define RTE_ETH_RX_OFFLOAD_BIT2STR(_name)	\
> +	{ RTE_ETH_RX_OFFLOAD_##_name, #_name }
> +
>   static const struct {
>   	uint64_t offload;
>   	const char *name;
> @@ -128,9 +131,11 @@ struct rte_eth_xstats_name_off {
>   	RTE_RX_OFFLOAD_BIT2STR(SCTP_CKSUM),
>   	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>   	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>   };
>   
>   #undef RTE_RX_OFFLOAD_BIT2STR
> +#undef RTE_ETH_RX_OFFLOAD_BIT2STR
>   
>   #define RTE_TX_OFFLOAD_BIT2STR(_name)	\
>   	{ DEV_TX_OFFLOAD_##_name, #_name }
> @@ -1770,10 +1775,14 @@ struct rte_eth_dev *
>   		       struct rte_mempool *mp)
>   {
>   	int ret;
> +	uint16_t seg_idx;
>   	uint32_t mbp_buf_size;
>   	struct rte_eth_dev *dev;
>   	struct rte_eth_dev_info dev_info;
>   	struct rte_eth_rxconf local_conf;
> +	const struct rte_eth_rxseg *rx_seg;
> +	struct rte_eth_rxseg seg_single = { .mp = mp};
> +	uint16_t n_seg;
>   	void **rxq;
>   
>   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -1784,13 +1793,32 @@ struct rte_eth_dev *
>   		return -EINVAL;
>   	}
>   
> -	if (mp == NULL) {
> -		RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> -		return -EINVAL;
> -	}
> -
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_setup, -ENOTSUP);
>   
> +	rx_seg = rx_conf->rx_seg;
> +	n_seg = rx_conf->rx_nseg;
> +	if (rx_seg == NULL) {
> +		/* Exclude ambiguities about segment descrtiptions. */
> +		if (n_seg) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Non empty array with null pointer\n");
> +			return -EINVAL;
> +		}
> +		rx_seg = &seg_single;
> +		n_seg = 1;

Why setting 'rx_seg' & 'n_seg'? Why not leaving them NULL and 0 when not used?
This was PMD can do NULL/0 check and can know they are not used.


I think better to do a "if (mp == NULL)" check here, both 'rx_seg' & 'mp' should 
not be NULL.

> +	} else {
> +		if (n_seg == 0) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Invalid zero descriptions number\n");
> +			return -EINVAL;
> +		}
> +		if (mp) {
> +			RTE_ETHDEV_LOG(ERR,
> +				       "Memory pool duplicated definition\n");
> +			return -EINVAL;
> +		}
> +	}
> +
>   	/*
>   	 * Check the size of the mbuf data buffer.
>   	 * This value must be provided in the private data of the memory pool.
> @@ -1800,23 +1828,48 @@ struct rte_eth_dev *
>   	if (ret != 0)
>   		return ret;
>   
> -	if (mp->private_data_size < sizeof(struct rte_pktmbuf_pool_private)) {
> -		RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> -			mp->name, (int)mp->private_data_size,
> -			(int)sizeof(struct rte_pktmbuf_pool_private));
> -		return -ENOSPC;
> -	}
> -	mbp_buf_size = rte_pktmbuf_data_room_size(mp);
> +	for (seg_idx = 0; seg_idx < n_seg; seg_idx++) {
> +		struct rte_mempool *mpl = rx_seg[seg_idx].mp;
> +		uint32_t length = rx_seg[seg_idx].length;
> +		uint32_t offset = rx_seg[seg_idx].offset;
> +		uint32_t head_room = seg_idx ? 0 : RTE_PKTMBUF_HEADROOM;
>   
> -	if (mbp_buf_size < dev_info.min_rx_bufsize + RTE_PKTMBUF_HEADROOM) {
> -		RTE_ETHDEV_LOG(ERR,
> -			"%s mbuf_data_room_size %d < %d (RTE_PKTMBUF_HEADROOM=%d + min_rx_bufsize(dev)=%d)\n",
> -			mp->name, (int)mbp_buf_size,
> -			(int)(RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize),
> -			(int)RTE_PKTMBUF_HEADROOM,
> -			(int)dev_info.min_rx_bufsize);
> -		return -EINVAL;
> +		if (mpl == NULL) {
> +			RTE_ETHDEV_LOG(ERR, "Invalid null mempool pointer\n");
> +			return -EINVAL;
> +		}
> +
> +		if (mpl->private_data_size <
> +				sizeof(struct rte_pktmbuf_pool_private)) {
> +			RTE_ETHDEV_LOG(ERR, "%s private_data_size %d < %d\n",
> +				mpl->name, (int)mpl->private_data_size,
> +				(int)sizeof(struct rte_pktmbuf_pool_private));
> +			return -ENOSPC;
> +		}
> +
> +		mbp_buf_size = rte_pktmbuf_data_room_size(mpl);
> +		length = length ? length : (mbp_buf_size - head_room);
> +		if (mbp_buf_size < length + offset + head_room) {
> +			RTE_ETHDEV_LOG(ERR,
> +				"%s mbuf_data_room_size %u < %u"
> +				" (segment length=%u + segment offset=%u)\n",
> +				mpl->name, mbp_buf_size,
> +				length + offset, length, offset);
> +			return -EINVAL;
> +		}
>   	}
> +	/* Check the minimal buffer size for the single segment only. */

This is the main branch, what do you think moving the comment to the beggining 
of above for loop and add a comment about testing the multiple segment.

Btw, I have a concern that this single/multi segment can cause a confusion with 
multi segment packets. Can something else, like "split package" can be used 
instead of segment?

> +	if (mp && (mbp_buf_size < dev_info.min_rx_bufsize +
> +				  RTE_PKTMBUF_HEADROOM)) {
> +		RTE_ETHDEV_LOG(ERR,
> +			       "%s mbuf_data_room_size %u < %u "
> +			       "(RTE_PKTMBUF_HEADROOM=%u + "
> +			       "min_rx_bufsize(dev)=%u)\n",
> +			       mp->name, mbp_buf_size,
> +			       RTE_PKTMBUF_HEADROOM + dev_info.min_rx_bufsize,
> +			       RTE_PKTMBUF_HEADROOM, dev_info.min_rx_bufsize);
> +			return -EINVAL;
> +		}
>   
>   	/* Use default specified by driver, if nb_rx_desc is zero */
>   	if (nb_rx_desc == 0) {
> @@ -1914,8 +1967,6 @@ struct rte_eth_dev *
>   			dev->data->min_rx_buf_size = mbp_buf_size;
>   	}
>   
> -	rte_ethdev_trace_rxq_setup(port_id, rx_queue_id, nb_rx_desc, mp,
> -		rx_conf, ret);

Is this removed intentionally?

>   	return eth_err(port_id, ret);
>   }
>   
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 5bcfbb8..9cf0a03 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -970,6 +970,16 @@ struct rte_eth_txmode {
>   };
>   
>   /**
> + * A structure used to configure an RX packet segment to split.
> + */
> +struct rte_eth_rxseg {
> +	struct rte_mempool *mp; /**< Memory pools to allocate segment from */
> +	uint16_t length; /**< Segment data length, configures split point. */
> +	uint16_t offset; /**< Data offset from beginning of mbuf data buffer */
> +	uint32_t reserved; /**< Reserved field */
> +};
> +
> +/**
>    * A structure used to configure an RX ring of an Ethernet port.
>    */
>   struct rte_eth_rxconf {
> @@ -977,13 +987,23 @@ struct rte_eth_rxconf {
>   	uint16_t rx_free_thresh; /**< Drives the freeing of RX descriptors. */
>   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> +	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * The pointer to the array of segment descriptions, each element
> +	 * describes the memory pool, maximal segment data length, initial
> +	 * data offset from the beginning of data buffer in mbuf. This allow
> +	 * to specify the dedicated properties for each segment in the receiving
> +	 * buffer - pool, buffer offset, maximal segment size. The number of
> +	 * segment descriptions in the array is specified by the rx_nseg
> +	 * field.
> +	 */

What do you think providing a short description here, and move above comment to 
abice "struct rte_eth_rxseg" struct?

> +	struct rte_eth_rxseg *rx_seg;
>   	/**
>   	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>   	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
>   	 * fields on rte_eth_dev_info structure are allowed to be set.
>   	 */
>   	uint64_t offloads;
> -

unrelated

>   	uint64_t reserved_64s[2]; /**< Reserved for future fields */
>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
>   };
> @@ -1260,6 +1280,7 @@ struct rte_eth_conf {
>   #define DEV_RX_OFFLOAD_SCTP_CKSUM	0x00020000
>   #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>   #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> +#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>   
>   #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>   				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> @@ -2027,6 +2048,41 @@ int rte_eth_dev_configure(uint16_t port_id, uint16_t nb_rx_queue,
>    *   No need to repeat any bit in rx_conf->offloads which has already been
>    *   enabled in rte_eth_dev_configure() at port level. An offloading enabled
>    *   at port level can't be disabled at queue level.

Can it be possible to put a kind of marker here, like "@rx_seg & @rx_nseg", to 
clarify what are you talking about.

> + *   The configuration structure also contains the pointer to the array
> + *   of the receiving buffer segment descriptions, each element describes
> + *   the memory pool, maximal segment data length, initial data offset from
> + *   the beginning of data buffer in mbuf. This allow to specify the dedicated
> + *   properties for each segment in the receiving buffer - pool, buffer
> + *   offset, maximal segment size. If RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload
> + *   flag is configured the PMD will split the received packets into multiple
> + *   segments according to the specification in the description array:
> + *   - the first network buffer will be allocated from the memory pool,
> + *     specified in the first segment description element, the second
> + *     network buffer - from the pool in the second segment description
> + *     element and so on. If there is no enough elements to describe
> + *     the buffer for entire packet of maximal length the pool from the last
> + *     valid element will be used to allocate the buffers from for the rest
> + *     of segments.
> + *   - the offsets from the segment description elements will provide the
> + *     data offset from the buffer beginning except the first mbuf - for this
> + *     one the offset is added to the RTE_PKTMBUF_HEADROOM to get actual
> + *     offset from the buffer beginning. If there is no enough elements
> + *     to describe the buffer for entire packet of maximal length the offsets
> + *     for the rest of segment will be supposed to be zero.
> + *   - the data length being received to each segment is limited by the
> + *     length specified in the segment description element. The data receiving
> + *     starts with filling up the first mbuf data buffer, if the specified
> + *     maximal segment length is reached and there are data remaining
> + *     (packet is longer than buffer in the first mbuf) the following data
> + *     will be pushed to the next segment up to its own length. If the first
> + *     two segments is not enough to store all the packet data the next
> + *     (third) segment will be engaged and so on. If the length in the segment
> + *     description element is zero the actual buffer size will be deduced
> + *     from the appropriate memory pool properties. If there is no enough
> + *     elements to describe the buffer for entire packet of maximal length
> + *     the buffer size will be deduced from the pool of the last valid
> + *     element for the all remaining segments.
> + *

I think as a first thing the comment should clarify that if @rx_seg provided 
'mb_pool' should be NULL, and if split Rx feature is not used "@rx_seg & 
@rx_nseg" should be NULL and 0.

Also above is too wordy, it is hard to follow. Like "@rx_seg & @rx_nseg" are 
only taken into account if application provides 
'RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT' offload should be clearer to see, etc.
Can you try to simplify it, perhpas moving some of above comments to the "struct 
rte_eth_rxseg" can work?

>    * @param mb_pool
>    *   The pointer to the memory pool from which to allocate *rte_mbuf* network
>    *   memory buffers to populate each descriptor of the receive ring.
> diff --git a/lib/librte_ethdev/rte_ethdev_version.map b/lib/librte_ethdev/rte_ethdev_version.map
> index f8a0945..25f7cee 100644
> --- a/lib/librte_ethdev/rte_ethdev_version.map
> +++ b/lib/librte_ethdev/rte_ethdev_version.map
> @@ -232,6 +232,7 @@ EXPERIMENTAL {
>   	rte_eth_fec_get_capability;
>   	rte_eth_fec_get;
>   	rte_eth_fec_set;
> +

unrelated


More information about the dev mailing list