[dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine

Yongseok Koh yskoh at mellanox.com
Fri Oct 26 06:22:01 CEST 2018


On Thu, Oct 25, 2018 at 07:37:56AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Tuesday, October 23, 2018 13:06
> > To: Slava Ovsiienko <viacheslavo at mellanox.com>
> > Cc: Shahaf Shuler <shahafs at mellanox.com>; dev at dpdk.org
> > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> > routine
> > 
> > On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
[...]
> > > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> > >   *   Pointer to the list of actions.
> > >   * @param[out] action_flags
> > >   *   Pointer to the detected actions.
> > > + * @param[out] tunnel
> > > + *   Pointer to tunnel encapsulation parameters structure to fill.
> > >   *
> > >   * @return
> > >   *   Maximum size of memory for actions.
> > >   */
> > >  static int
> > >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > > -			      uint64_t *action_flags)
> > > +			      uint64_t *action_flags,
> > > +			      void *tunnel)
> > 
> > This func is to get actions and size but you are parsing and filling tunnel info
> > here. It would be better to move parsing to translate() because it anyway has
> > multiple if conditions (same as switch/case) to set TCA_TUNNEL_KEY_ENC_*
> > there.
> Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?

Yes.

> OK, let's move it to translate stage. Anyway, we need to keep encap structure
> for local/neigh rules.
> 
> > 
> > >  {
> > >  	int size = 0;
> > >  	uint64_t flags = 0;
> > > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> > >  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> > */
> > >  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> > */
> > >  			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel)
> > +
> > > +				RTE_ALIGN_CEIL /* preceding encap params.
> > */
> > > +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +				MNL_ALIGNTO);
> > 
> > Is it different from SZ_NLATTR_TYPE_OF(struct
> > mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO) instead.
> 
> It is written intentionally in this form. It means that there is struct mlx5_flow_tcf_vxlan_encap 
> at the beginning of buffer. This is not the NL attribute, usage of SZ_NLATTR_TYPE_OF is
> not relevant here. Alignment is needed for the following Netlink message.

Good point. Understood.

> > 
> > > +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +			size +=	RTE_ALIGN_CEIL /* preceding decap params.
> > */
> > > +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +				MNL_ALIGNTO);
> > 
> > Same here.
> > 
> > > +			flags |= MLX5_ACTION_VXLAN_DECAP;
> > > +			break;
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> > >
> > >  /**
> > > + * Convert VXLAN VNI to 32-bit integer.
> > > + *
> > > + * @param[in] vni
> > > + *   VXLAN VNI in 24-bit wire format.
> > > + *
> > > + * @return
> > > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > > + */
> > > +static rte_be32_t
> > 
> > make it inline.
> OK. Missed point.
> 
> > 
> > > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > > +	rte_be32_t ret;
> > 
> > Defining ret as rte_be32_t? The return value of this func which is bswap(ret)
> > is also rte_be32_t??
> Yes. And it is directly stored in the net-endian NL attribute. 
> I've compiled and checked the listing of the function you proposed. It seems to be best, I'll take it.
> 
> > 
> > > +
> > > +	ret = vni[0];
> > > +	ret = (ret << 8) | vni[1];
> > > +	ret = (ret << 8) | vni[2];
> > > +	return RTE_BE32(ret);
> > 
> > Use rte_cpu_to_be_*() instead. But I still don't understand why you shuffle
> > bytes twice. One with shift and or and other by bswap().
> And it works. There are three bytes in very bizarre order (in NL attribute) - 0, vni[0], vni[1], vni[2].
> 
> > 
> > {
> > 	union {
> > 		uint8_t vni[4];
> > 		rte_be32_t dword;
> > 	} ret = {
> > 		.vni = { 0, vni[0], vni[1], vni[2] },
> > 	};
> > 	return ret.dword;
> > }
> > 
> > This will have the same result without extra cost.
> 
> OK. Investigated, it is the best for x86-64. Also I'm going to test it on the ARM 32,
> with various compilers, just curious.
> 
> > 
> > > +}
> > > +
> > > +/**
> > >   * Prepare a flow object for Linux TC flower. It calculates the maximum
> > size of
> > >   * memory required, allocates the memory, initializes Netlink message
> > headers
> > >   * and set unique TC message handle.
> > > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> > >  	struct mlx5_flow *dev_flow;
> > >  	struct nlmsghdr *nlh;
> > >  	struct tcmsg *tcm;
> > > +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > > +	uint8_t *sp, *tun = NULL;
> > >
> > >  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > > -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> > > -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > > +	size += flow_tcf_get_actions_and_size(actions, action_flags,
> > &encap);
> > > +	dev_flow = rte_zmalloc(__func__, size,
> > > +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > > +				(size_t)MNL_ALIGNTO));
> > 
> > Why RTE_MAX between the two? Note that it is alignment for start address
> > of the memory and the minimum alignment is cacheline size. On x86, non-
> > zero value less than 64 will have same result as 64.
> 
> OK. Thanks for note.
> It is not expected the structure alignments exceed the cache line size.
> So? Just specify zero?
> > 
> > >  	if (!dev_flow) {
> > >  		rte_flow_error_set(error, ENOMEM,
> > >  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > >  				   "not enough memory to create E-Switch
> > flow");
> > >  		return NULL;
> > >  	}
> > > -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > > +	sp = (uint8_t *)(dev_flow + 1);
> > > +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > > +		tun = sp;
> > > +		sp += RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +			MNL_ALIGNTO);
> > 
> > And why should it be aligned? 
> 
> Netlink message should be aligned. It follows the mlx5_flow_tcf_vxlan_encap,
> that's why the pointer is aligned.

Not true. There's no requirement for nl msg buffer alignment. MNL_ALIGNTO is for
mainly size alignment. For example, checkout the source code of
mnl_nlmsg_put_header(void *buf). There's no requirement of aligning the start
address of buf. But, size of any entries (hdr, attr ...) should be aligned to
MNL_ALIGNTO(4).

> 
> As the size of dev_flow might not be aligned, it
> > is meaningless, isn't it? If you think it must be aligned for better performance
> > (not much anyway), you can use __rte_aligned(MNL_ALIGNTO) on the struct
> Hm. Where we can use __rte_aligned? Could you clarify, please.

For example,

struct mlx5_flow_tcf_tunnel_hdr {
	uint32_t type; /**< Tunnel action type. */
	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
	unsigned int ifindex_org; /**< Original dst/src interface */
	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
} __rte_aligned(MNL_ALIGNTO);

A good example is the struct rte_mbuf. If this attribute is used, the size of
the struct will be aligned to the value.

If you still want to make the nl msg aligned,

	dev_flow = rte_zmalloc(..., MNL_ALIGNTO); /* anyway cacheline aligned. */
	tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);
	nlh = mnl_nlmsg_put_header(tun);

with adding '__rte_aligned(MNL_ALIGNTO)' to struct mlx5_flow_tcf_vxlan_encap/decap.

Then, nlh will be aligned. You should make sure size is correctly calculated.

> 
> > definition but not for mlx5_flow (it's not only for tcf, have to do it manually).
> > 
> > > +		size -= RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +			MNL_ALIGNTO);
> > 
> > Don't you have to subtract sizeof(struct mlx5_flow) as well? But like I
> > mentioned, if '.nlsize' below isn't needed, you don't need to have this
> > calculation either.
> Yes, it is a bug. Should be fixed. Thank you.
> Let's discuss whether we can keep the nlsize under NDEBUG switch.

I agreed on using NDEBUG for it.

> 
> > 
> > > +		encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > > +		memcpy(tun, &encap,
> > > +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > > +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > +		tun = sp;
> > > +		sp += RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +			MNL_ALIGNTO);
> > > +		size -= RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +			MNL_ALIGNTO);
> > > +		encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > > +		memcpy(tun, &encap,
> > > +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > > +	}
> > > +	nlh = mnl_nlmsg_put_header(sp);
> > >  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > >  	*dev_flow = (struct mlx5_flow){
> > >  		.tcf = (struct mlx5_flow_tcf){
> > > +			.nlsize = size,
> > >  			.nlh = nlh,
> > >  			.tcm = tcm,
> > > +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > > +			.item_flags = *item_flags,
> > > +			.action_flags = *action_flags,
> > >  		},
> > >  	};
> > >  	/*
[...]
> > > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> > >  					(na_vlan_priority) =
> > >  					conf.of_set_vlan_pcp->vlan_pcp;
> > >  			}
> > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +			assert(decap.vxlan);
> > > +			assert(dev_flow->tcf.tunnel);
> > > +			dev_flow->tcf.tunnel->ifindex_ptr
> > > +				= (unsigned int *)&tcm->tcm_ifindex;
> > > +			na_act_index =
> > > +				mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +			assert(na_act_index);
> > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +			na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +			assert(na_act);
> > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +				sizeof(struct tc_tunnel_key),
> > > +				&(struct tc_tunnel_key){
> > > +					.action = TC_ACT_PIPE,
> > > +					.t_action =
> > TCA_TUNNEL_KEY_ACT_RELEASE,
> > > +					});
> > > +			mnl_attr_nest_end(nlh, na_act);
> > > +			mnl_attr_nest_end(nlh, na_act_index);
> > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +			assert(encap.vxlan);
> > > +			na_act_index =
> > > +				mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +			assert(na_act_index);
> > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +			na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +			assert(na_act);
> > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +				sizeof(struct tc_tunnel_key),
> > > +				&(struct tc_tunnel_key){
> > > +					.action = TC_ACT_PIPE,
> > > +					.t_action =
> > TCA_TUNNEL_KEY_ACT_SET,
> > > +					});
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > > +				mnl_attr_put_u16(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> > > +					 encap.vxlan->udp.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > > +					 encap.vxlan->ipv4.src);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > > +					 encap.vxlan->ipv4.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > > +				mnl_attr_put(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > > +					 sizeof(encap.vxlan->ipv6.src),
> > > +					 &encap.vxlan->ipv6.src);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > > +				mnl_attr_put(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > > +					 sizeof(encap.vxlan->ipv6.dst),
> > > +					 &encap.vxlan->ipv6.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> > > +					 vxlan_vni_as_be32
> > > +						(encap.vxlan->vxlan.vni));
> > > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > > +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> > 0); #endif
> > 
> > TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> > you treat it differently with #ifdef/#endif?
> 
> As it was found it is not defined on old kernels, on some our CI machines
> compilation errors occurred.

In your first patch, TCA_TUNNEL_KEY_NO_CSUM is defined if there isn't
HAVE_TC_ACT_TUNNEL_KEY. Actually I'm wondering why it is different from
HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT. It looks like the following is needed -
HAVE_TCA_TUNNEL_KEY_NO_CSUM ??


	#ifdef HAVE_TC_ACT_TUNNEL_KEY

	#include <linux/tc_act/tc_tunnel_key.h>

	#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
	#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
	#endif

	#ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
	#define TCA_TUNNEL_KEY_NO_CSUM 10
	#endif

	#else /* HAVE_TC_ACT_TUNNEL_KEY */


Thanks,
Yongseok



More information about the dev mailing list