[dpdk-dev] [PATCH 2/6] net/mlx5: add framework for switch flow rules

Yongseok Koh yskoh at mellanox.com
Thu Jul 12 20:25:27 CEST 2018


On Thu, Jul 12, 2018 at 12:46:46PM +0200, Adrien Mazarguil wrote:
> On Wed, Jul 11, 2018 at 05:59:18PM -0700, Yongseok Koh wrote:
> > On Wed, Jun 27, 2018 at 08:08:12PM +0200, Adrien Mazarguil wrote:
> > > Because mlx5 switch flow rules are configured through Netlink (TC
> > > interface) and have little in common with Verbs, this patch adds a separate
> > > parser function to handle them.
> > > 
> > > - mlx5_nl_flow_transpose() converts a rte_flow rule to its TC equivalent
> > >   and stores the result in a buffer.
> > > 
> > > - mlx5_nl_flow_brand() gives a unique handle to a flow rule buffer.
> > > 
> > > - mlx5_nl_flow_create() instantiates a flow rule on the device based on
> > >   such a buffer.
> > > 
> > > - mlx5_nl_flow_destroy() performs the reverse operation.
> > > 
> > > These functions are called by the existing implementation when encountering
> > > flow rules which must be offloaded to the switch (currently relying on the
> > > transfer attribute).
> > > 
> > > Signed-off-by: Adrien Mazarguil <adrien.mazarguil at 6wind.com>
> > > Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro at 6wind.com>
> <snip>
> > > diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
> > > index 9241855be..93b245991 100644
> > > --- a/drivers/net/mlx5/mlx5_flow.c
> > > +++ b/drivers/net/mlx5/mlx5_flow.c
> > > @@ -4,6 +4,7 @@
> > >   */
> > >  
> > >  #include <sys/queue.h>
> > > +#include <stdalign.h>
> > >  #include <stdint.h>
> > >  #include <string.h>
> > >  
> > > @@ -271,6 +272,7 @@ struct rte_flow {
> > >  	/**< Store tunnel packet type data to store in Rx queue. */
> > >  	uint8_t key[40]; /**< RSS hash key. */
> > >  	uint16_t (*queue)[]; /**< Destination queues to redirect traffic to. */
> > > +	void *nl_flow; /**< Netlink flow buffer if relevant. */
> > >  };
> > >  
> > >  static const struct rte_flow_ops mlx5_flow_ops = {
> > > @@ -2403,6 +2405,106 @@ mlx5_flow_actions(struct rte_eth_dev *dev,
> > >  }
> > >  
> > >  /**
> > > + * Validate flow rule and fill flow structure accordingly.
> > > + *
> > > + * @param dev
> > > + *   Pointer to Ethernet device.
> > > + * @param[out] flow
> > > + *   Pointer to flow structure.
> > > + * @param flow_size
> > > + *   Size of allocated space for @p flow.
> > > + * @param[in] attr
> > > + *   Flow rule attributes.
> > > + * @param[in] pattern
> > > + *   Pattern specification (list terminated by the END pattern item).
> > > + * @param[in] actions
> > > + *   Associated actions (list terminated by the END action).
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + *   A positive value representing the size of the flow object in bytes
> > > + *   regardless of @p flow_size on success, a negative errno value otherwise
> > > + *   and rte_errno is set.
> > > + */
> > > +static int
> > > +mlx5_flow_merge_switch(struct rte_eth_dev *dev,
> > > +		       struct rte_flow *flow,
> > > +		       size_t flow_size,
> > > +		       const struct rte_flow_attr *attr,
> > > +		       const struct rte_flow_item pattern[],
> > > +		       const struct rte_flow_action actions[],
> > > +		       struct rte_flow_error *error)
> > > +{
> > > +	struct priv *priv = dev->data->dev_private;
> > > +	unsigned int n = mlx5_domain_to_port_id(priv->domain_id, NULL, 0);
> > > +	uint16_t port_list[!n + n];
> > > +	struct mlx5_nl_flow_ptoi ptoi[!n + n + 1];
> > > +	size_t off = RTE_ALIGN_CEIL(sizeof(*flow), alignof(max_align_t));
> > > +	unsigned int i;
> > > +	unsigned int own = 0;
> > > +	int ret;
> > > +
> > > +	/* At least one port is needed when no switch domain is present. */
> > > +	if (!n) {
> > > +		n = 1;
> > > +		port_list[0] = dev->data->port_id;
> > > +	} else {
> > > +		n = mlx5_domain_to_port_id(priv->domain_id, port_list, n);
> > > +		if (n > RTE_DIM(port_list))
> > > +			n = RTE_DIM(port_list);
> > > +	}
> > > +	for (i = 0; i != n; ++i) {
> > > +		struct rte_eth_dev_info dev_info;
> > > +
> > > +		rte_eth_dev_info_get(port_list[i], &dev_info);
> > > +		if (port_list[i] == dev->data->port_id)
> > > +			own = i;
> > > +		ptoi[i].port_id = port_list[i];
> > > +		ptoi[i].ifindex = dev_info.if_index;
> > > +	}
> > > +	/* Ensure first entry of ptoi[] is the current device. */
> > > +	if (own) {
> > > +		ptoi[n] = ptoi[0];
> > > +		ptoi[0] = ptoi[own];
> > > +		ptoi[own] = ptoi[n];
> > > +	}
> > > +	/* An entry with zero ifindex terminates ptoi[]. */
> > > +	ptoi[n].port_id = 0;
> > > +	ptoi[n].ifindex = 0;
> > > +	if (flow_size < off)
> > > +		flow_size = 0;
> > > +	ret = mlx5_nl_flow_transpose((uint8_t *)flow + off,
> > > +				     flow_size ? flow_size - off : 0,
> > > +				     ptoi, attr, pattern, actions, error);
> > > +	if (ret < 0)
> > > +		return ret;
> > 
> > So, there's an assumption that the buffer allocated outside of this API is
> > enough to include all the messages in mlx5_nl_flow_transpose(), right? If
> > flow_size isn't enough, buf_tmp will be used and _transpose() doesn't return
> > error but required size. Sounds confusing, may need to make a change or to have
> > clearer documentation.
> 
> Well, isn't it already documented? Besides these are the usual snprintf()
> semantics used everywhere in these files, I think this was a major topic of
> discussion with Nelio on the flow rework series :)
> 
> buf_tmp[] is internal to mlx5_nl_flow_transpose() and used as a fallback to
> complete a pass when input buffer is not large enough (including
> zero-sized). Having a valid buffer is a constraint imposed by libmnl,
> because we badly want to know how much space will be needed assuming the
> flow rule was successfully processed.
> 
> Without libmnl, the helpers it provides would have been written in a way
> that doesn't require buf_tmp[]. However libmnl is just too convenient to
> pass up, hence this compromise.
> 
> (just to remind onlookers, we want to allocate the minimum amount of memory
> we possibly can for resources needed by each flow rule, and do so through a
> single allocation, end goal being to support millions of flow rules while
> wasting as little memory as possible.)
> 
> <snip>
> > > diff --git a/drivers/net/mlx5/mlx5_nl_flow.c b/drivers/net/mlx5/mlx5_nl_flow.c
> > > index 7a8683b03..1fc62fb0a 100644
> > > --- a/drivers/net/mlx5/mlx5_nl_flow.c
> > > +++ b/drivers/net/mlx5/mlx5_nl_flow.c
> > > @@ -5,7 +5,9 @@
> > >  
> > >  #include <errno.h>
> > >  #include <libmnl/libmnl.h>
> > > +#include <linux/if_ether.h>
> > >  #include <linux/netlink.h>
> > > +#include <linux/pkt_cls.h>
> > >  #include <linux/pkt_sched.h>
> > >  #include <linux/rtnetlink.h>
> > >  #include <stdalign.h>
> > > @@ -14,11 +16,248 @@
> > >  #include <stdlib.h>
> > >  #include <sys/socket.h>
> > >  
> > > +#include <rte_byteorder.h>
> > >  #include <rte_errno.h>
> > >  #include <rte_flow.h>
> > >  
> > >  #include "mlx5.h"
> > >  
> > > +/** Parser state definitions for mlx5_nl_flow_trans[]. */
> > > +enum mlx5_nl_flow_trans {
> > > +	INVALID,
> > > +	BACK,
> > > +	ATTR,
> > > +	PATTERN,
> > > +	ITEM_VOID,
> > > +	ACTIONS,
> > > +	ACTION_VOID,
> > > +	END,
> > > +};
> > > +
> > > +#define TRANS(...) (const enum mlx5_nl_flow_trans []){ __VA_ARGS__, INVALID, }
> > > +
> > > +#define PATTERN_COMMON \
> > > +	ITEM_VOID, ACTIONS
> > > +#define ACTIONS_COMMON \
> > > +	ACTION_VOID, END
> > > +
> > > +/** Parser state transitions used by mlx5_nl_flow_transpose(). */
> > > +static const enum mlx5_nl_flow_trans *const mlx5_nl_flow_trans[] = {
> > > +	[INVALID] = NULL,
> > > +	[BACK] = NULL,
> > > +	[ATTR] = TRANS(PATTERN),
> > > +	[PATTERN] = TRANS(PATTERN_COMMON),
> > > +	[ITEM_VOID] = TRANS(BACK),
> > > +	[ACTIONS] = TRANS(ACTIONS_COMMON),
> > > +	[ACTION_VOID] = TRANS(BACK),
> > > +	[END] = NULL,
> > > +};
> > > +
> > > +/**
> > > + * Transpose flow rule description to rtnetlink message.
> > > + *
> > > + * This function transposes a flow rule description to a traffic control
> > > + * (TC) filter creation message ready to be sent over Netlink.
> > > + *
> > > + * Target interface is specified as the first entry of the @p ptoi table.
> > > + * Subsequent entries enable this function to resolve other DPDK port IDs
> > > + * found in the flow rule.
> > > + *
> > > + * @param[out] buf
> > > + *   Output message buffer. May be NULL when @p size is 0.
> > > + * @param size
> > > + *   Size of @p buf. Message may be truncated if not large enough.
> > > + * @param[in] ptoi
> > > + *   DPDK port ID to network interface index translation table. This table
> > > + *   is terminated by an entry with a zero ifindex value.
> > > + * @param[in] attr
> > > + *   Flow rule attributes.
> > > + * @param[in] pattern
> > > + *   Pattern specification.
> > > + * @param[in] actions
> > > + *   Associated actions.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + *   A positive value representing the exact size of the message in bytes
> > > + *   regardless of the @p size parameter on success, a negative errno value
> > > + *   otherwise and rte_errno is set.
> > > + */
> > > +int
> > > +mlx5_nl_flow_transpose(void *buf,
> > > +		       size_t size,
> > > +		       const struct mlx5_nl_flow_ptoi *ptoi,
> > > +		       const struct rte_flow_attr *attr,
> > > +		       const struct rte_flow_item *pattern,
> > > +		       const struct rte_flow_action *actions,
> > > +		       struct rte_flow_error *error)
> > > +{
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf_tmp[MNL_SOCKET_BUFFER_SIZE];
> > > +	const struct rte_flow_item *item;
> > > +	const struct rte_flow_action *action;
> > > +	unsigned int n;
> > > +	struct nlattr *na_flower;
> > > +	struct nlattr *na_flower_act;
> > > +	const enum mlx5_nl_flow_trans *trans;
> > > +	const enum mlx5_nl_flow_trans *back;
> > > +
> > > +	if (!size)
> > > +		goto error_nobufs;
> > > +init:
> > > +	item = pattern;
> > > +	action = actions;
> > > +	n = 0;
> > > +	na_flower = NULL;
> > > +	na_flower_act = NULL;
> > > +	trans = TRANS(ATTR);
> > > +	back = trans;
> > > +trans:
> > > +	switch (trans[n++]) {
> > > +		struct nlmsghdr *nlh;
> > > +		struct tcmsg *tcm;
> > > +
> > > +	case INVALID:
> > > +		if (item->type)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ITEM,
> > > +				 item, "unsupported pattern item combination");
> > > +		else if (action->type)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ACTION,
> > > +				 action, "unsupported action combination");
> > > +		return rte_flow_error_set
> > > +			(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 "flow rule lacks some kind of fate action");
> > > +	case BACK:
> > > +		trans = back;
> > > +		n = 0;
> > > +		goto trans;
> > > +	case ATTR:
> > > +		/*
> > > +		 * Supported attributes: no groups, some priorities and
> > > +		 * ingress only. Don't care about transfer as it is the
> > > +		 * caller's problem.
> > > +		 */
> > > +		if (attr->group)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP,
> > > +				 RTE_FLOW_ERROR_TYPE_ATTR_GROUP,
> > > +				 attr, "groups are not supported");
> > > +		if (attr->priority > 0xfffe)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP,
> > > +				 RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY,
> > > +				 attr, "lowest priority level is 0xfffe");
> > > +		if (!attr->ingress)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP,
> > > +				 RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > > +				 attr, "only ingress is supported");
> > > +		if (attr->egress)
> > > +			return rte_flow_error_set
> > > +				(error, ENOTSUP,
> > > +				 RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > > +				 attr, "egress is not supported");
> > > +		if (size < mnl_nlmsg_size(sizeof(*tcm)))
> > > +			goto error_nobufs;
> > > +		nlh = mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type = 0;
> > > +		nlh->nlmsg_flags = 0;
> > > +		nlh->nlmsg_seq = 0;
> > > +		tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > > +		tcm->tcm_family = AF_UNSPEC;
> > > +		tcm->tcm_ifindex = ptoi[0].ifindex;
> > > +		/*
> > > +		 * Let kernel pick a handle by default. A predictable handle
> > > +		 * can be set by the caller on the resulting buffer through
> > > +		 * mlx5_nl_flow_brand().
> > > +		 */
> > > +		tcm->tcm_handle = 0;
> > > +		tcm->tcm_parent = TC_H_MAKE(TC_H_INGRESS, TC_H_MIN_INGRESS);
> > > +		/*
> > > +		 * Priority cannot be zero to prevent the kernel from
> > > +		 * picking one automatically.
> > > +		 */
> > > +		tcm->tcm_info = TC_H_MAKE((attr->priority + 1) << 16,
> > > +					  RTE_BE16(ETH_P_ALL));
> > > +		break;
> > > +	case PATTERN:
> > > +		if (!mnl_attr_put_strz_check(buf, size, TCA_KIND, "flower"))
> > > +			goto error_nobufs;
> > > +		na_flower = mnl_attr_nest_start_check(buf, size, TCA_OPTIONS);
> > > +		if (!na_flower)
> > > +			goto error_nobufs;
> > > +		if (!mnl_attr_put_u32_check(buf, size, TCA_FLOWER_FLAGS,
> > > +					    TCA_CLS_FLAGS_SKIP_SW))
> > > +			goto error_nobufs;
> > > +		break;
> > > +	case ITEM_VOID:
> > > +		if (item->type != RTE_FLOW_ITEM_TYPE_VOID)
> > > +			goto trans;
> > > +		++item;
> > > +		break;
> > > +	case ACTIONS:
> > > +		if (item->type != RTE_FLOW_ITEM_TYPE_END)
> > > +			goto trans;
> > > +		assert(na_flower);
> > > +		assert(!na_flower_act);
> > > +		na_flower_act =
> > > +			mnl_attr_nest_start_check(buf, size, TCA_FLOWER_ACT);
> > > +		if (!na_flower_act)
> > > +			goto error_nobufs;
> > > +		break;
> > > +	case ACTION_VOID:
> > > +		if (action->type != RTE_FLOW_ACTION_TYPE_VOID)
> > > +			goto trans;
> > > +		++action;
> > > +		break;
> > > +	case END:
> > > +		if (item->type != RTE_FLOW_ITEM_TYPE_END ||
> > > +		    action->type != RTE_FLOW_ACTION_TYPE_END)
> > > +			goto trans;
> > > +		if (na_flower_act)
> > > +			mnl_attr_nest_end(buf, na_flower_act);
> > > +		if (na_flower)
> > > +			mnl_attr_nest_end(buf, na_flower);
> > > +		nlh = buf;
> > > +		return nlh->nlmsg_len;
> > > +	}
> > > +	back = trans;
> > > +	trans = mlx5_nl_flow_trans[trans[n - 1]];
> > > +	n = 0;
> > > +	goto trans;
> > > +error_nobufs:
> > > +	if (buf != buf_tmp) {
> > > +		buf = buf_tmp;
> > > +		size = sizeof(buf_tmp);
> > > +		goto init;
> > > +	}
> > 
> > Continuing my comment above.
> > This part is unclear. It looks to me that this func does:
> > 
> > 1) if size is zero, consider it as a testing call to know the amount of memory
> > required.
> 
> Yeah, in fact this one is a shortcut to speed up this specific scenario as
> it happens all the time in the two-pass use case. You can lump it with 2).
> 
> > 2) if size isn't zero but not enough, it stops writing to buf and start over to
> > return the amount of memory required instead of returning error.
> > 3) if size isn't zero and enough, it fills in buf.
> > 
> > Do I correctly understand?
> 
> Yes. Another minor note for 2), the returned buffer is also filled up to the
> point of failure (mimics snprintf()).
> 
> Perhaps the following snippet can better summarize the envisioned approach:
> 
>  int ret = snprintf(NULL, 0, "something", ...);
> 
>  if (ret < 0) {
>      goto court;
>  } else {
>      char buf[ret];
> 
>      snprintf(buf, sizeof(buf), "something", ...); /* Guaranteed. */
>      [...]
>  }

I know you and Nelio mimicked snprintf() but as _merge() isn't a public API for
users but an internal API. I didn't think it should necessarily be like that. I
hoped to have it used either for testing (knowing size) or real translation - 1)
and 3). And no possibility for 2), then 2) would've been handled by assert().
I beleive this could've made the code simpler.

However, as I already acked Nelio's patchset, agreed on the idea and Nelio
already documented the behavior,

Acked-by: Yongseok Koh <yskoh at mellanox.com>

Thanks


More information about the dev mailing list