[dpdk-dev] [PATCH v5] net/mlx4: support hardware TSO

Matan Azrad matan at mellanox.com
Mon Jul 9 20:18:17 CEST 2018
Previous message: [dpdk-dev] [PATCH v5] net/mlx4: support hardware TSO
Next message: [dpdk-dev] [PATCH v6] net/mlx4: support hardware TSO
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Moti

I continue the discussion here in spite of the next version was out just to see the full discussions. 

From: Mordechay Haimovsky
> inline
> 
> > From: Matan Azrad
> > Hi Moti
> >
> > Please see some comments below.
> >
> > From: Mordechay Haimovsky
> > > Implement support for hardware TSO.
> > >
> > > Signed-off-by: Moti Haimovsky <motih at mellanox.com>
...
> > > +	do {
> > > +		/* how many dseg entries do we have in the current TXBB ?
> > > */
> > > +		nb_segs_txbb = (MLX4_TXBB_SIZE -
> > > +				((uintptr_t)dseg & (MLX4_TXBB_SIZE - 1))) >>
> > > +			       MLX4_SEG_SHIFT;
> > > +		switch (nb_segs_txbb) {
> > > +		default:
> > > +			/* Should never happen. */
> > > +			rte_panic("%p: Invalid number of SGEs(%d) for a
> > > TXBB",
> > > +			(void *)txq, nb_segs_txbb);
> > > +			/* rte_panic never returns. */
> >
> > Since this default case should not happen because of the above
> > calculation I think we don't need it.
> > Just "break" if the compiler complain of default case lack.
> >
> Although "default" is not mandatory in switch case statement it is a good
> practice to have it even just for code clarity.
> so I will keep it there.

But the rte_panic code (and all the default block) is redundant and we don't need redundant code in our data-path.
You can remain a comment if you want for clarifying.
 

> > > +		case 4:
> > > +			/* Memory region key for this memory pool. */
> > > +			lkey = mlx4_tx_mb2mr(txq, sbuf);
> > > +			if (unlikely(lkey == (uint32_t)-1))
> > > +				goto err;
> > > +			dseg->addr =
> > > +
> > > rte_cpu_to_be_64(rte_pktmbuf_mtod_offset(sbuf,
> > > +								     uintptr_t,
> > > +								     sb_of));
> > > +			dseg->lkey = lkey;
> > > +			/*
> > > +			 * This data segment starts at the beginning of a new
> > > +			 * TXBB, so we need to postpone its byte_count
> > > writing
> > > +			 * for later.
> > > +			 */
> > > +			pv[*pv_counter].dseg = dseg;
> > > +			/*
> > > +			 * Zero length segment is treated as inline segment
> > > +			 * with zero data.
> > > +			 */
> > > +			data_len = sbuf->data_len - sb_of;
> >
> > Is there a chance that the data_len will be negative? Rolled in this case?
> Since we verify ahead the all l2,l3 and L4 headers reside in the same fragment
> there is no reason for data_len to become negative, this is why I use uint16_t
> which is  the same data type used in struct rte_mbuf for representing
> data_len , and as we do it in mlx4_tx_burst_segs.
> 
> > Maybe it is better to change it for int16_t and to replace the next
> > check to
> > be:
> > data_len > 0 ? data_len : 0x80000000
> >
> I will keep this the way it is for 2 reasons:
> 1. Seems to me more cumbersome then what I wrote.

OK, you right here, if it cannot be negative we shouldn't change it :)

> 2. Code consistency wise, this is how we also wrote it in mlx4_tx_burst_segs,
>      What's good there is also good here.

Not agree, here is really a different case from there, a lot of assumption are different and the code may reflects it.

> > And I think I found a way to remove the sb_of calculations for each
> segment:
> >
> > Each segment will create the next segment parameters while only the
> > pre loop calculation for the first segment parameters will calculate
> > the header
> > offset:
> >
> > The parameters: data_len and sb_of.
> >
> > So before the loop:
> > sb_of = tinfo->tso_header_size;
> > data_len = sbuf->data_len - sb_of;
> >
> > And inside the loop (after the check of nb_segs):
> > sb_of = 0;
> > data_len = sbuf->data_len(the next sbuf);
> >
> > so each segment calculates the next segment parameters and we don't
> > need the "- sb_of" calculation per segment.
> >
> NICE :)
> 

Sorry for see it only now, but we don't need even the "sb_of=0" per segment:
We can add one more parameter for the next segment 
addr = rte_pktmbuf_mtod_offset(sbuf, uintptr_t, tinfo->tso_header_size)
before the loop
and
addr= rte_pktmbuf_mtod(sbuf, uintptr_t)
inside the loop

so finally we save 2 cycles per segment :)
...
> > > +static inline volatile struct mlx4_wqe_data_seg *
> > > +mlx4_tx_burst_fill_tso_hdr(struct rte_mbuf *buf,
> > > +			   struct txq *txq,
> > > +			   struct tso_info *tinfo,
> > > +			   volatile struct mlx4_wqe_ctrl_seg *ctrl) {
> > > +	volatile struct mlx4_wqe_lso_seg *tseg =
> > > +		(volatile struct mlx4_wqe_lso_seg *)(ctrl + 1);
> > > +	struct mlx4_sq *sq = &txq->msq;
> > > +	struct pv *pv = tinfo->pv;
> > > +	int *pv_counter = &tinfo->pv_counter;
> > > +	int remain_size = tinfo->tso_header_size;
> > > +	char *from = rte_pktmbuf_mtod(buf, char *);
> > > +	uint16_t txbb_avail_space;
> > > +	/* Union to overcome volatile constraints when copying TSO header.
> > > */
> > > +	union {
> > > +		volatile uint8_t *vto;
> > > +		uint8_t *to;
> > > +	} thdr = { .vto = (volatile uint8_t *)tseg->header, };
> > > +
> > > +	/*
> > > +	 * TSO data always starts at offset 20 from the beginning of the TXBB
> > > +	 * (16 byte ctrl + 4byte TSO desc). Since each TXBB is 64Byte aligned
> > > +	 * we can write the first 44 TSO header bytes without worry for TxQ
> > > +	 * wrapping or overwriting the first TXBB 32bit word.
> > > +	 */
> > > +	txbb_avail_space = MLX4_TXBB_SIZE -
> > > +			   (sizeof(struct mlx4_wqe_ctrl_seg) +
> > > +			    sizeof(struct mlx4_wqe_lso_seg));
> >
> > I think that better name is txbb_tail_size.
> I think that txbb_avail_space is good enough, so no change here.

My suggestion is because this size is only the tail size in the txbb without the first 4 bytes, so it may be more reasonable.
I can understand also your suggestion while avail points to the available txbb size to write (the first 4B are not available now only later).
I'm not going to argue about it :)
 
> 
> >
> > > +	while (remain_size >= (int)(txbb_avail_space + sizeof(uint32_t))) {
> > > +		/* Copy to end of txbb. */
> > > +		rte_memcpy(thdr.to, from, txbb_avail_space);
> > > +		from += txbb_avail_space;
> > > +		thdr.to += txbb_avail_space;
> > > +		/* New TXBB, Check for TxQ wrap. */
> > > +		if (thdr.to >= sq->eob)
> > > +			thdr.vto = sq->buf;
> > > +		/* New TXBB, stash the first 32bits for later use. */
> > > +		pv[*pv_counter].dst = (volatile uint32_t *)thdr.to;
> > > +		pv[(*pv_counter)++].val = *(uint32_t *)from,
> > > +		from += sizeof(uint32_t);
> > > +		thdr.to += sizeof(uint32_t);
> > > +		remain_size -= (txbb_avail_space + sizeof(uint32_t));
> >
> > You don't need the () here.
> True
> >
> > > +		/* Avail space in new TXBB is TXBB size - 4 */
> > > +		txbb_avail_space = MLX4_TXBB_SIZE - sizeof(uint32_t);
> > > +	}
> > > +	if (remain_size > txbb_avail_space) {
> > > +		rte_memcpy(thdr.to, from, txbb_avail_space);
> > > +		from += txbb_avail_space;
> > > +		thdr.to += txbb_avail_space;
> > > +		remain_size -= txbb_avail_space;
> > > +		/* New TXBB, Check for TxQ wrap. */
> > > +		if (thdr.to >= sq->eob)
> > > +			thdr.vto = sq->buf;
> > > +		pv[*pv_counter].dst = (volatile uint32_t *)thdr.to;
> > > +		rte_memcpy(&pv[*pv_counter].val, from, remain_size);
> > > +		(*pv_counter)++;
> > > +	} else {
> >
> > Here it should be else if (remain_size > 0).
> true
> >
> > > +		rte_memcpy(thdr.to, from, remain_size);
> > > +	}
> > > +
> > > +	tseg->mss_hdr_size = rte_cpu_to_be_32((buf->tso_segsz << 16) |
> > > +					      tinfo->tso_header_size);
> > > +	/* Calculate data segment location */
> > > +	return (volatile struct mlx4_wqe_data_seg *)
> > > +				((uintptr_t)tseg + tinfo->wqe_tso_seg_size);
> > > }
> > > +
> > > +/**
> > > + * Write data segments and header for TSO uni/multi segment packet.
> > > + *
> > > + * @param buf
> > > + *   Pointer to the first packet mbuf.
> > > + * @param txq
> > > + *   Pointer to Tx queue structure.
> > > + * @param ctrl
> > > + *   Pointer to the WQE control segment.
> > > + *
> > > + * @return
> > > + *   Pointer to the next WQE control segment on success, NULL
> otherwise.
> > > + */
> > > +static volatile struct mlx4_wqe_ctrl_seg * mlx4_tx_burst_tso(struct
> > > +rte_mbuf *buf, struct txq *txq,
> > > +		  volatile struct mlx4_wqe_ctrl_seg *ctrl) {
> > > +	volatile struct mlx4_wqe_data_seg *dseg;
> > > +	volatile struct mlx4_wqe_ctrl_seg *ctrl_next;
> > > +	struct mlx4_sq *sq = &txq->msq;
> > > +	struct tso_info tinfo;
> > > +	struct pv *pv;
> > > +	int pv_counter;
> > > +	int ret;
> > > +
> > > +	ret = mlx4_tx_burst_tso_get_params(buf, txq, &tinfo);
> > > +	if (unlikely(ret))
> > > +		goto error;
> > > +	dseg = mlx4_tx_burst_fill_tso_hdr(buf, txq, &tinfo, ctrl);
> > > +	if (unlikely(dseg == NULL))
> > > +		goto error;
> > > +	if ((uintptr_t)dseg >= (uintptr_t)sq->eob)
> > > +		dseg = (volatile struct mlx4_wqe_data_seg *)
> > > +					((uintptr_t)dseg - sq->size);
> > > +	ctrl_next = mlx4_tx_burst_fill_tso_dsegs(buf, txq, &tinfo, dseg, ctrl);
> > > +	if (unlikely(ctrl_next == NULL))
> > > +		goto error;
> > > +	/* Write the first DWORD of each TXBB save earlier. */
> > > +	pv = tinfo.pv;
> > > +	pv_counter = tinfo.pv_counter;
> > > +	/* Need a barrier here before writing the first TXBB word. */
> > > +	rte_io_wmb();
> >
> > > +	for (--pv_counter; pv_counter  >= 0; pv_counter--)
> >
> > Since we don't need the first check do while statement is better.
> > To be fully safe you can use likely check before the memory barrier.
> >
> Will return the if statement But will not change the loop as it is the same as in
> mlx4_tx_burst_segs and I do want to have a consistent code.

I'm not agree with this statement as above - different assumptions - different optimized code.

Here and in the mlx4_tx_burst_segs code we don't need the first check in the for loop and we don't need redundant checks in our datapath.

So both should be updated.

While the difference is in the prior check, here it should be likely and there it should not.
So actually there the "for" loop can stay as is but we don't need the first if check of pv_counter because it is already checked in the for loop.

I suggest to optimize both(prior patch for mlx4_tx_burst_segs optimization).

> > > +		*pv[pv_counter].dst = pv[pv_counter].val;
> > > +	ctrl->fence_size = tinfo.fence_size;
> > > +	sq->remain_size -= tinfo.wqe_size;
> > > +	return ctrl_next;
Previous message: [dpdk-dev] [PATCH v5] net/mlx4: support hardware TSO
Next message: [dpdk-dev] [PATCH v6] net/mlx4: support hardware TSO
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list