[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

Kulasek, TomaszX tomaszx.kulasek at intel.com
Wed Oct 19 17:42:48 CEST 2016


Hi Olivier,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> Sent: Tuesday, October 18, 2016 16:57
> To: Kulasek, TomaszX <tomaszx.kulasek at intel.com>; dev at dpdk.org
> Cc: Ananyev, Konstantin <konstantin.ananyev at intel.com>;
> thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
> 
> Hi Tomasz,
> 
> I think the principle of tx_prep() is good, it may for instance help to
> remove the function virtio_tso_fix_cksum() from the virtio, and maybe even
> change the mbuf TSO/cksum API.
> 
> I have some questions/comments below, I'm sorry it comes very late.
> 
> On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> > Added API for `rte_eth_tx_prep`
> >
> > uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> > 	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> >
> > Added fields to the `struct rte_eth_desc_lim`:
> >
> > 	uint16_t nb_seg_max;
> > 		/**< Max number of segments per whole packet. */
> >
> > 	uint16_t nb_mtu_seg_max;
> > 		/**< Max number of segments per one MTU */
> 
> Not sure I understand the second one. Is this is case of TSO?
> 
> Is it a usual limitation in different network hardware?
> Can this info be retrieved/used by the application?
> 

Yes, limitation of number of segments may differ depend of TSO/non TSO, e.g. for Fortville NIC. This information is available for application

> >
> > Created `rte_pkt.h` header with common used functions:
> >
> > int rte_validate_tx_offload(struct rte_mbuf *m)
> > 	to validate general requirements for tx offload in packet such a
> > 	flag completness. In current implementation this function is called
> > 	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> >
> > int rte_phdr_cksum_fix(struct rte_mbuf *m)
> > 	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > 	before hardware tx checksum offload.
> > 	 - for non-TSO tcp/udp packets full pseudo-header checksum is
> > 	   counted and set.
> > 	 - for TSO the IP payload length is not included.
> 
> Why not in rte_net.h?
> 
> 
> > [...]
> >
> > @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t
> queue_id,
> >  	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts,
> > nb_pkts);  }
> >
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Ethernet
> device.
> > + *
> > + * The rte_eth_tx_prep() function is invoked to prepare output
> > +packets to be
> > + * transmitted on the output queue *queue_id* of the Ethernet device
> > +designated
> > + * by its *port_id*.
> > + * The *nb_pkts* parameter is the number of packets to be prepared
> > +which are
> > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of
> > +them
> > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > + * For each packet to send, the rte_eth_tx_prep() function performs
> > + * the following operations:
> > + *
> > + * - Check if packet meets devices requirements for tx offloads.
> 
> Do you mean hardware requirements?
> Can the application be aware of these requirements? I mean capability
> flags, or something in dev_infos?

Yes. If some offloads cannot be handled by hardware, it fails. Also if e.g. number of segments is invalid and so on.

> 
> Maybe the comment could be more precise?
> 
> > + * - Check limitations about number of segments.
> > + *
> > + * - Check additional requirements when debug is enabled.
> 
> What kind of additional requirements?
> 

We may assume, that application setts right, e.g. ip header version is required for most of checksum offloads. To not have additional performance overhead, these checks are done only when debug is on.

> > + *
> > + * - Update and/or reset required checksums when tx offload is set for
> packet.
> > + *
> 
> By reading this, I think it may not be clear for the user about what
> should be set in the mbuf. In mbuf API, it is said:
> 
>  * TCP segmentation offload. To enable this offload feature for a
>  * packet to be transmitted on hardware supporting TSO:
>  *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>  *    PKT_TX_TCP_CKSUM)
>  *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
>  *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
>  *    to 0 in the packet
>  *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
>  *  - calculate the pseudo header checksum without taking ip_len in
> account,
>  *    and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
>  *    rte_ipv6_phdr_cksum() that can be used as helpers.
> 
> 
> If I understand well, using tx_prep(), the user will have to do the same
> except writing the IP checksum to 0, and without setting the TCP pseudo
> header checksum, right?
> 

Right, but this header information is still valid for tx_burst operation done without preparation stage.

> 
> > + * The rte_eth_tx_prep() function returns the number of packets ready
> > + to be
> > + * sent. A return value equal to *nb_pkts* means that all packets are
> > + valid and
> > + * ready to be sent.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue through which output packets must
> be
> > + *   sent.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_pkts
> > + *   The address of an array of *nb_pkts* pointers to *rte_mbuf*
> structures
> > + *   which contain the output packets.
> > + * @param nb_pkts
> > + *   The maximum number of packets to process.
> > + * @return
> > + *   The number of packets correct and ready to be sent. The return
> value can be
> > + *   less than the value of the *tx_pkts* parameter when some packet
> doesn't
> > + *   meet devices requirements with rte_errno set appropriately.
> > + */
> 
> Can we add the constraint that invalid packets are left untouched?
> 
> I think most of the time there will be a software fallback in that case,
> so it would be good to ensure that this function does not change the flags
> or the packet data.

In current implementation, if packet is invalid, its data is never modified. Only checks are done. The only exception is when checksum needs to be updated or initialized, but it's done after packet validation.
If we want to use/restore packet in application it didn't should be changed in any way for invalid packets.

> 
> Another thing that could be interesting for the caller is to know the
> reason of the failure. Maybe the different errno types could be detailed
> here. For instance:
> - EINVAL: offload flags are not correctly set (i.e. would fail whatever
>   the hardware)
> - ENOTSUP: the offload feature is not supported by the hardware
> - ...
> 

Ok.

> > +
> > +#ifdef RTE_ETHDEV_TX_PREP
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf
> **tx_pkts,
> > +		uint16_t nb_pkts)
> > +{
> > +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +
> > +	if (!dev->tx_pkt_prep)
> > +		return nb_pkts;
> > +
> > +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> > +	if (queue_id >= dev->data->nb_tx_queues) {
> > +		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
> > +		rte_errno = -EINVAL;
> > +		return 0;
> > +	}
> > +#endif
> 
> Why checking the queue_id but not the port_id?
> 
> Maybe the API comment should also be modified to say that the port_id has
> to be valid, because most ethdev functions do the check.
> 

I can add this check when debug is on to made it more complete, and update an API comment for the default case.

> > +
> > +	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
> > +			tx_pkts, nb_pkts);
> > +}
> > +
> > +#else
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t
> queue_id,
> > +		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts) {
> > +	return nb_pkts;
> > +}
> > +
> > +#endif
> > +
> 
> nit: I wonder if the #else part should be inside the function instead
> (with the proper RTE_SET_USED()), it would avoid to define the prototype
> twice.
> 
> >  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t
> count,
> >  		void *userdata);
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 109e666..cfd6284 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -276,6 +276,15 @@ extern "C" {
> >   */
> >  #define PKT_TX_OUTER_IPV4   (1ULL << 59)
> >
> > +#define PKT_TX_OFFLOAD_MASK (    \
> > +		PKT_TX_IP_CKSUM |        \
> > +		PKT_TX_L4_MASK |         \
> > +		PKT_TX_OUTER_IP_CKSUM |  \
> > +		PKT_TX_TCP_SEG |         \
> > +		PKT_TX_QINQ_PKT |        \
> > +		PKT_TX_VLAN_PKT |        \
> > +		PKT_TX_TUNNEL_MASK)
> > +
> 
> Could you add an API comment?
> 
> > --- /dev/null
> > +++ b/lib/librte_net/rte_pkt.h
> >
> > [...]
> > +/**
> > + * Validate general requirements for tx offload in packet.
> > + */
> 
> The API comment does not have the usual format.
> 
> > +static inline int
> > +rte_validate_tx_offload(struct rte_mbuf *m)
> 
> should be const struct rte_mbuf *m
> 
> > +{
> > +	uint64_t ol_flags = m->ol_flags;
> > +
> > +	/* Does packet set any of available offloads? */
> > +	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> > +		return 0;
> > +
> > +	/* IP checksum can be counted only for IPv4 packet */
> > +	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> > +		return -EINVAL;
> > +
> > +	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> > +		/* IP type not set */
> > +		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> > +			return -EINVAL;
> > +
> > +	if (ol_flags & PKT_TX_TCP_SEG)
> > +		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
> > +		if ((m->tso_segsz == 0) ||
> > +				((ol_flags & PKT_TX_IPV4) && !(ol_flags &
> PKT_TX_IP_CKSUM)))
> > +			return -EINVAL;
> > +
> > +	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> > +	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags &
> PKT_TX_OUTER_IPV4))
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
> 
> It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is set.
> 

Yes, for performance reasons we expect, that application sets these values right. This is the most of "additional checks" mentioned before.

> I'd say this function should go in rte_mbuf.h, because it's purely related
> to mbuf flags, and does not rely on packet data or network headers.
> 
> 
> > +
> > +/**
> > + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > +before
> > + * hardware tx checksum.
> > + * For non-TSO tcp/udp packets full pseudo-header checksum is counted
> and set.
> > + * For TSO the IP payload length is not included.
> > + */
> 
> The API comment should be fixed.

Ok.

> 
> > +static inline int
> > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > +	struct ipv4_hdr *ipv4_hdr;
> > +	struct ipv6_hdr *ipv6_hdr;
> > +	struct tcp_hdr *tcp_hdr;
> > +	struct udp_hdr *udp_hdr;
> > +	uint64_t ol_flags = m->ol_flags;
> > +	uint64_t inner_l3_offset = m->l2_len;
> > +
> > +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > +		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> > +
> > +	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum = 0;
> > +
> > +			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr,
> ol_flags);
> > +		} else {
> > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO udp */
> > +			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr,
> ol_flags);
> > +		}
> > +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > +			(ol_flags & PKT_TX_TCP_SEG)) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum = 0;
> > +
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> > +		} else {
> > +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +#endif /* _RTE_PKT_H_ */
> >
> 
> The function expects that all the network headers are in the first, and
> that each of them is contiguous.
> 

Yes, I see...

> Also, I had an interesting remark from Stephen [1] on a similar code.
> If the mbuf is a clone, it will modify the data of the direct mbuf, which
> should be read-only. Note that it is likely to happen in a TCP stack,
> because the packet is kept locally in case it has to be retransmitted.
> Cloning a mbuf is more efficient than duplicating it.
> 
> I plan to fix it in virtio code by "uncloning" the headers.
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html
> 
> 
> 
> Regards,
> Olivier

Thanks,
Tomasz


More information about the dev mailing list