[dpdk-dev] [RFC PATCH 1/1] net/mlx5: add vectorized Rx/Tx burst for ARM

Yongseok Koh yskoh at mellanox.com
Tue Sep 12 01:28:07 CEST 2017


Hi Nelio,

Sorry for delayed reply. I was on vacation for a week.

On Mon, Sep 04, 2017 at 02:37:05PM +0200, Nélio Laranjeiro wrote:
> Hi Yongseok,
> 
> Some questions/comments below,
> 
> On Fri, Aug 25, 2017 at 11:40:23AM -0700, Yongseok Koh wrote:
> > New Rx/Tx burst functions are added using NEON vector instructions for ARM
> > CPU.
> > 
> > Signed-off-by: Yongseok Koh <yskoh at mellanox.com>
[...]
> > @@ -1542,7 +1542,7 @@ priv_select_rx_function(struct priv *priv)
> >  	if (priv_check_vec_rx_support(priv) > 0) {
> >  		priv_prep_vec_rx_function(priv);
> >  		priv->dev->rx_pkt_burst = mlx5_rx_burst_vec;
> > -		DEBUG("selected RX vectorized function");
> > +		WARN("selected RX vectorized function");
> 
> This should remain in DEBUG level.
My bad, that was just for debugging purpose while I was testing the code. Will
change it back when I send out a patchset for integration.

> >  	} else {
> >  		priv->dev->rx_pkt_burst = mlx5_rx_burst;
> >  	}
> > diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h
> > index 608072f7e..01e95b466 100644
> > --- a/drivers/net/mlx5/mlx5_prm.h
> > +++ b/drivers/net/mlx5/mlx5_prm.h
> > @@ -224,6 +224,20 @@ struct mlx5_mpw {
> >  };
> >  
> >  /* CQ element structure - should be equal to the cache line size */
> > +#if 0
> > +struct mlx5_cqe { // 16B
> > +       uint16_t hdr_type_etc;
> > +       uint8_t pkt_info;
> > +       uint8_t sop_drop_qpn; /* flow_tag */
> > +       uint16_t byte_cnt;
> > +       uint16_t vlan_info;
> > +       uint32_t rx_hash_res;
> > +       uint8_t timestamp;
> > +       uint8_t wqe_counter;
> > +       uint8_t rsvd4;
> > +       uint8_t op_own;
> > +};
> 
> Seems this structure will never be used due to the #if 0.
This code was just for SW emulation to emulate the result of 16B CQE size. I
should've trimmed all the debugging/testing code before I sent out this RFC.

[...]
> > @@ -1064,6 +1119,10 @@ rxq_ctrl_setup(struct rte_eth_dev *dev, struct rxq_ctrl *rxq_ctrl,
> >  		      (void *)dev, strerror(ret));
> >  		goto error;
> >  	}
> > +#ifdef SW_EMULATION
> > +	rxq_cqe_comp_en = priv->cqe_comp;
> > +	emulate_rxq_cqe_setup(&tmpl);
> > +#endif
> >  	/* Reuse buffers from original queue if possible. */
> >  	if (rxq_ctrl->rxq.elts_n) {
> >  		assert(1 << rxq_ctrl->rxq.elts_n == desc);
> > @@ -1092,7 +1151,9 @@ rxq_ctrl_setup(struct rte_eth_dev *dev, struct rxq_ctrl *rxq_ctrl,
> >  	/* Update doorbell counter. */
> >  	rxq_ctrl->rxq.rq_ci = desc >> rxq_ctrl->rxq.sges_n;
> >  	rte_wmb();
> > +#ifndef SW_EMULATION
> >  	*rxq_ctrl->rxq.rq_db = htonl(rxq_ctrl->rxq.rq_ci);
> > +#endif
> >  	DEBUG("%p: rxq updated with %p", (void *)rxq_ctrl, (void *)&tmpl);
> >  	assert(ret == 0);
> >  	return 0;
> 
> What is the purpose of this SW_EMULATION?
That's to emulate packet Rx without getting device involved. And it was to
optimize cycle budget from pure SW perspective. This will get removed in a
formal patch. Please understand that it is RFC code.

> > diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > index 7de1d1086..6aae00b77 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > @@ -602,11 +602,12 @@ mlx5_tx_dbrec(struct txq *txq, volatile struct mlx5_wqe *wqe)
> >  	uint64_t *dst = (uint64_t *)((uintptr_t)txq->bf_reg);
> >  	volatile uint64_t *src = ((volatile uint64_t *)wqe);
> >  
> > -	rte_wmb();
> > +	rte_compiler_barrier();
> >  	*txq->qp_db = htonl(txq->wqe_ci);
> >  	/* Ensure ordering between DB record and BF copy. */
> >  	rte_wmb();
> >  	*dst = *src;
> > +	rte_wmb();
> >  }
> 
> Is this better to have the rte_compiler_barrier() instead of the rte_io_wmb()?
You are right. I'm also aware of Shahaf's patch - "net/mlx5: replace memory
barrier type". This code block was to include Shahaf's patch for testing and
I've already included Shahaf's final patch. As his patch's been merged, will
remove this hunk in a formal patch.

[...]
> > +/* Verbs header. */
> > +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */
> > +#ifdef PEDANTIC
> > +#pragma GCC diagnostic ignored "-Wpedantic"
> > +#endif
> > +#include <infiniband/verbs.h>
> > +#include <infiniband/mlx5_hw.h>
> > +#include <infiniband/arch.h>
> > +#ifdef PEDANTIC
> > +#pragma GCC diagnostic error "-Wpedantic"
> > +#endif
>
> Should this patch be included before the upstream re-work?
No, after the rework. Will change it properly then.

[...]
> > +		vst1q_u8(dseg, desc);
> > +#ifdef MLX5_PMD_SOFT_COUNTERS
> > +		tx_byte += DATA_LEN(pkt);
> > +#endif
> > +	}
> > +#ifdef MLX5_PMD_SOFT_COUNTERS
> > +	txq->stats.obytes += tx_byte;
> > +#endif
> > +}
> > +
> > +#if 0
> 
> #if 0?
> 
> It does not help to read an RFC with embed code blocks not even compiled.
Right. That was my mistake in rushing to meet the submission deadline. Sorry for
the mess. Now, I've completed porting the commented part and I'll be able to
send out a formal patchset soon.

[...]
> > +		__asm__ volatile (
> > +		/* A.1 load mCQEs into a 128bit register. */
> > +		"ld1 {v16.16b - v17.16b}, [%[mcq]]\n\t"
> > +		/* B.1 store rearm data to mbuf. */
> > +		"st1 {%[rearm].2d}, [%[e0]]\n\t"
> > +		"add %[e0], %[e0], #16\n\t"
> > +		"st1 {%[rearm].2d}, [%[e1]]\n\t"
> > +		"add %[e1], %[e1], #16\n\t"
> > +		/* C.1 combine data from mCQEs with rx_descriptor_fields1. */
> > +		"tbl v18.16b, {v16.16b}, %[mcqe_shuf_m1].16b\n\t"
> > +		"tbl v19.16b, {v16.16b}, %[mcqe_shuf_m2].16b\n\t"
> > +		"sub v18.8h, v18.8h, %[crc_adj].8h\n\t"
> > +		"sub v19.8h, v19.8h, %[crc_adj].8h\n\t"
> > +		"orr v18.16b, v18.16b, %[rxdf].16b\n\t"
> > +		"orr v19.16b, v19.16b, %[rxdf].16b\n\t"
> > +		/* D.1 store rx_descriptor_fields1. */
> > +		"st1 {v18.2d}, [%[e0]]\n\t"
> > +		"st1 {v19.2d}, [%[e1]]\n\t"
> > +		/* B.1 store rearm data to mbuf. */
> > +		"st1 {%[rearm].2d}, [%[e2]]\n\t"
> > +		"add %[e2], %[e2], #16\n\t"
> > +		"st1 {%[rearm].2d}, [%[e3]]\n\t"
> > +		"add %[e3], %[e3], #16\n\t"
> > +		/* C.1 combine data from mCQEs with rx_descriptor_fields1. */
> > +		"tbl v18.16b, {v17.16b}, %[mcqe_shuf_m1].16b\n\t"
> > +		"tbl v19.16b, {v17.16b}, %[mcqe_shuf_m2].16b\n\t"
> > +		"sub v18.8h, v18.8h, %[crc_adj].8h\n\t"
> > +		"sub v19.8h, v19.8h, %[crc_adj].8h\n\t"
> > +		"orr v18.16b, v18.16b, %[rxdf].16b\n\t"
> > +		"orr v19.16b, v19.16b, %[rxdf].16b\n\t"
> > +		/* D.1 store rx_descriptor_fields1. */
> > +		"st1 {v18.2d}, [%[e2]]\n\t"
> > +		"st1 {v19.2d}, [%[e3]]\n\t"
> > +#ifdef MLX5_PMD_SOFT_COUNTERS
> > +		"tbl %[byte_cnt].8b, {v16.16b - v17.16b}, %[len_shuf_m].8b\n\t"
> > +#endif
> > +		:[byte_cnt]"=&w"(byte_cnt)
> > +		:[mcq]"r"(p), [rxdf]"w"(rxdf), [rearm]"w"(rearm),
> > +		 [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
> > +		 [mcqe_shuf_m1]"w"(mcqe_shuf_m1),
> > +		 [mcqe_shuf_m2]"w"(mcqe_shuf_m2),
> > +		 [crc_adj]"w"(crc_adj), [len_shuf_m]"w"(len_shuf_m)
> > +		:"memory", "v16", "v17", "v18", "v19");
> 
> Is not there a better way instead of writing all those assembly instructions,
> maybe by using a set of macros?
[...]
> It follows the same algorithm as for x86 side, unless the fact you seem to
> need a lot of assembly instruction due to the missing instrasics I suppose.
Yes, the main reason for this inline assembly was due to lack of load
instrinsics - vld1q_u64_x3() and inefficiency of a shuffle intrinsic
vqtbl3q_u8() in the gcc-5.4.0-6ubuntu1~16.04.4.

I'm quite sure that using vld1q_u64_x3() is more efficient than three
vld1q_u64() although vld1q_u64_x3() internally executes 3 micro-ops in NEON.
This is used to load the first three 16byte blocks from a CQE.

And in vqtbl3q_u8() of arm_neon.h, it uses a redundant load instruction to load
data into 3 consecutive SIMD registers. This caused hotspots in perf-annotate.

> Do you already have some good feedbacks in performance?
Currently, I'm seeing 45% improvement comparing with regular rx/tx_burst in case
of single core. Also, when I firstly wrote the code, I could get ~20% benefit by
removing the hottest instructions with inline assembly.

I also concerned about difficulty for maintenance and code reading, so I tried
to minimize the assembly part as much as possible and it is limited to
fetching/manipulating CQE data on Rx, which is most critical in performance.

Thanks,
Yongseok


More information about the dev mailing list