[dpdk-dev] [PATCH 2/6] net/mlx5: add framework for switch flow rules

Adrien Mazarguil adrien.mazarguil at 6wind.com
Thu Jul 12 12:46:46 CEST 2018


On Wed, Jul 11, 2018 at 05:59:18PM -0700, Yongseok Koh wrote:
> On Wed, Jun 27, 2018 at 08:08:12PM +0200, Adrien Mazarguil wrote:
> > Because mlx5 switch flow rules are configured through Netlink (TC
> > interface) and have little in common with Verbs, this patch adds a separate
> > parser function to handle them.
> > 
> > - mlx5_nl_flow_transpose() converts a rte_flow rule to its TC equivalent
> >   and stores the result in a buffer.
> > 
> > - mlx5_nl_flow_brand() gives a unique handle to a flow rule buffer.
> > 
> > - mlx5_nl_flow_create() instantiates a flow rule on the device based on
> >   such a buffer.
> > 
> > - mlx5_nl_flow_destroy() performs the reverse operation.
> > 
> > These functions are called by the existing implementation when encountering
> > flow rules which must be offloaded to the switch (currently relying on the
> > transfer attribute).
> > 
> > Signed-off-by: Adrien Mazarguil <adrien.mazarguil at 6wind.com>
> > Signed-off-by: Nelio Laranjeiro <nelio.laranjeiro at 6wind.com>
<snip>
> > diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
> > index 9241855be..93b245991 100644
> > --- a/drivers/net/mlx5/mlx5_flow.c
> > +++ b/drivers/net/mlx5/mlx5_flow.c
> > @@ -4,6 +4,7 @@
> >   */
> >  
> >  #include <sys/queue.h>
> > +#include <stdalign.h>
> >  #include <stdint.h>
> >  #include <string.h>
> >  
> > @@ -271,6 +272,7 @@ struct rte_flow {
> >  	/**< Store tunnel packet type data to store in Rx queue. */
> >  	uint8_t key[40]; /**< RSS hash key. */
> >  	uint16_t (*queue)[]; /**< Destination queues to redirect traffic to. */
> > +	void *nl_flow; /**< Netlink flow buffer if relevant. */
> >  };
> >  
> >  static const struct rte_flow_ops mlx5_flow_ops = {
> > @@ -2403,6 +2405,106 @@ mlx5_flow_actions(struct rte_eth_dev *dev,
> >  }
> >  
> >  /**
> > + * Validate flow rule and fill flow structure accordingly.
> > + *
> > + * @param dev
> > + *   Pointer to Ethernet device.
> > + * @param[out] flow
> > + *   Pointer to flow structure.
> > + * @param flow_size
> > + *   Size of allocated space for @p flow.
> > + * @param[in] attr
> > + *   Flow rule attributes.
> > + * @param[in] pattern
> > + *   Pattern specification (list terminated by the END pattern item).
> > + * @param[in] actions
> > + *   Associated actions (list terminated by the END action).
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   A positive value representing the size of the flow object in bytes
> > + *   regardless of @p flow_size on success, a negative errno value otherwise
> > + *   and rte_errno is set.
> > + */
> > +static int
> > +mlx5_flow_merge_switch(struct rte_eth_dev *dev,
> > +		       struct rte_flow *flow,
> > +		       size_t flow_size,
> > +		       const struct rte_flow_attr *attr,
> > +		       const struct rte_flow_item pattern[],
> > +		       const struct rte_flow_action actions[],
> > +		       struct rte_flow_error *error)
> > +{
> > +	struct priv *priv = dev->data->dev_private;
> > +	unsigned int n = mlx5_domain_to_port_id(priv->domain_id, NULL, 0);
> > +	uint16_t port_list[!n + n];
> > +	struct mlx5_nl_flow_ptoi ptoi[!n + n + 1];
> > +	size_t off = RTE_ALIGN_CEIL(sizeof(*flow), alignof(max_align_t));
> > +	unsigned int i;
> > +	unsigned int own = 0;
> > +	int ret;
> > +
> > +	/* At least one port is needed when no switch domain is present. */
> > +	if (!n) {
> > +		n = 1;
> > +		port_list[0] = dev->data->port_id;
> > +	} else {
> > +		n = mlx5_domain_to_port_id(priv->domain_id, port_list, n);
> > +		if (n > RTE_DIM(port_list))
> > +			n = RTE_DIM(port_list);
> > +	}
> > +	for (i = 0; i != n; ++i) {
> > +		struct rte_eth_dev_info dev_info;
> > +
> > +		rte_eth_dev_info_get(port_list[i], &dev_info);
> > +		if (port_list[i] == dev->data->port_id)
> > +			own = i;
> > +		ptoi[i].port_id = port_list[i];
> > +		ptoi[i].ifindex = dev_info.if_index;
> > +	}
> > +	/* Ensure first entry of ptoi[] is the current device. */
> > +	if (own) {
> > +		ptoi[n] = ptoi[0];
> > +		ptoi[0] = ptoi[own];
> > +		ptoi[own] = ptoi[n];
> > +	}
> > +	/* An entry with zero ifindex terminates ptoi[]. */
> > +	ptoi[n].port_id = 0;
> > +	ptoi[n].ifindex = 0;
> > +	if (flow_size < off)
> > +		flow_size = 0;
> > +	ret = mlx5_nl_flow_transpose((uint8_t *)flow + off,
> > +				     flow_size ? flow_size - off : 0,
> > +				     ptoi, attr, pattern, actions, error);
> > +	if (ret < 0)
> > +		return ret;
> 
> So, there's an assumption that the buffer allocated outside of this API is
> enough to include all the messages in mlx5_nl_flow_transpose(), right? If
> flow_size isn't enough, buf_tmp will be used and _transpose() doesn't return
> error but required size. Sounds confusing, may need to make a change or to have
> clearer documentation.

Well, isn't it already documented? Besides these are the usual snprintf()
semantics used everywhere in these files, I think this was a major topic of
discussion with Nelio on the flow rework series :)

buf_tmp[] is internal to mlx5_nl_flow_transpose() and used as a fallback to
complete a pass when input buffer is not large enough (including
zero-sized). Having a valid buffer is a constraint imposed by libmnl,
because we badly want to know how much space will be needed assuming the
flow rule was successfully processed.

Without libmnl, the helpers it provides would have been written in a way
that doesn't require buf_tmp[]. However libmnl is just too convenient to
pass up, hence this compromise.

(just to remind onlookers, we want to allocate the minimum amount of memory
we possibly can for resources needed by each flow rule, and do so through a
single allocation, end goal being to support millions of flow rules while
wasting as little memory as possible.)

<snip>
> > diff --git a/drivers/net/mlx5/mlx5_nl_flow.c b/drivers/net/mlx5/mlx5_nl_flow.c
> > index 7a8683b03..1fc62fb0a 100644
> > --- a/drivers/net/mlx5/mlx5_nl_flow.c
> > +++ b/drivers/net/mlx5/mlx5_nl_flow.c
> > @@ -5,7 +5,9 @@
> >  
> >  #include <errno.h>
> >  #include <libmnl/libmnl.h>
> > +#include <linux/if_ether.h>
> >  #include <linux/netlink.h>
> > +#include <linux/pkt_cls.h>
> >  #include <linux/pkt_sched.h>
> >  #include <linux/rtnetlink.h>
> >  #include <stdalign.h>
> > @@ -14,11 +16,248 @@
> >  #include <stdlib.h>
> >  #include <sys/socket.h>
> >  
> > +#include <rte_byteorder.h>
> >  #include <rte_errno.h>
> >  #include <rte_flow.h>
> >  
> >  #include "mlx5.h"
> >  
> > +/** Parser state definitions for mlx5_nl_flow_trans[]. */
> > +enum mlx5_nl_flow_trans {
> > +	INVALID,
> > +	BACK,
> > +	ATTR,
> > +	PATTERN,
> > +	ITEM_VOID,
> > +	ACTIONS,
> > +	ACTION_VOID,
> > +	END,
> > +};
> > +
> > +#define TRANS(...) (const enum mlx5_nl_flow_trans []){ __VA_ARGS__, INVALID, }
> > +
> > +#define PATTERN_COMMON \
> > +	ITEM_VOID, ACTIONS
> > +#define ACTIONS_COMMON \
> > +	ACTION_VOID, END
> > +
> > +/** Parser state transitions used by mlx5_nl_flow_transpose(). */
> > +static const enum mlx5_nl_flow_trans *const mlx5_nl_flow_trans[] = {
> > +	[INVALID] = NULL,
> > +	[BACK] = NULL,
> > +	[ATTR] = TRANS(PATTERN),
> > +	[PATTERN] = TRANS(PATTERN_COMMON),
> > +	[ITEM_VOID] = TRANS(BACK),
> > +	[ACTIONS] = TRANS(ACTIONS_COMMON),
> > +	[ACTION_VOID] = TRANS(BACK),
> > +	[END] = NULL,
> > +};
> > +
> > +/**
> > + * Transpose flow rule description to rtnetlink message.
> > + *
> > + * This function transposes a flow rule description to a traffic control
> > + * (TC) filter creation message ready to be sent over Netlink.
> > + *
> > + * Target interface is specified as the first entry of the @p ptoi table.
> > + * Subsequent entries enable this function to resolve other DPDK port IDs
> > + * found in the flow rule.
> > + *
> > + * @param[out] buf
> > + *   Output message buffer. May be NULL when @p size is 0.
> > + * @param size
> > + *   Size of @p buf. Message may be truncated if not large enough.
> > + * @param[in] ptoi
> > + *   DPDK port ID to network interface index translation table. This table
> > + *   is terminated by an entry with a zero ifindex value.
> > + * @param[in] attr
> > + *   Flow rule attributes.
> > + * @param[in] pattern
> > + *   Pattern specification.
> > + * @param[in] actions
> > + *   Associated actions.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   A positive value representing the exact size of the message in bytes
> > + *   regardless of the @p size parameter on success, a negative errno value
> > + *   otherwise and rte_errno is set.
> > + */
> > +int
> > +mlx5_nl_flow_transpose(void *buf,
> > +		       size_t size,
> > +		       const struct mlx5_nl_flow_ptoi *ptoi,
> > +		       const struct rte_flow_attr *attr,
> > +		       const struct rte_flow_item *pattern,
> > +		       const struct rte_flow_action *actions,
> > +		       struct rte_flow_error *error)
> > +{
> > +	alignas(struct nlmsghdr)
> > +	uint8_t buf_tmp[MNL_SOCKET_BUFFER_SIZE];
> > +	const struct rte_flow_item *item;
> > +	const struct rte_flow_action *action;
> > +	unsigned int n;
> > +	struct nlattr *na_flower;
> > +	struct nlattr *na_flower_act;
> > +	const enum mlx5_nl_flow_trans *trans;
> > +	const enum mlx5_nl_flow_trans *back;
> > +
> > +	if (!size)
> > +		goto error_nobufs;
> > +init:
> > +	item = pattern;
> > +	action = actions;
> > +	n = 0;
> > +	na_flower = NULL;
> > +	na_flower_act = NULL;
> > +	trans = TRANS(ATTR);
> > +	back = trans;
> > +trans:
> > +	switch (trans[n++]) {
> > +		struct nlmsghdr *nlh;
> > +		struct tcmsg *tcm;
> > +
> > +	case INVALID:
> > +		if (item->type)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ITEM,
> > +				 item, "unsupported pattern item combination");
> > +		else if (action->type)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ACTION,
> > +				 action, "unsupported action combination");
> > +		return rte_flow_error_set
> > +			(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > +			 "flow rule lacks some kind of fate action");
> > +	case BACK:
> > +		trans = back;
> > +		n = 0;
> > +		goto trans;
> > +	case ATTR:
> > +		/*
> > +		 * Supported attributes: no groups, some priorities and
> > +		 * ingress only. Don't care about transfer as it is the
> > +		 * caller's problem.
> > +		 */
> > +		if (attr->group)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ATTR_GROUP,
> > +				 attr, "groups are not supported");
> > +		if (attr->priority > 0xfffe)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY,
> > +				 attr, "lowest priority level is 0xfffe");
> > +		if (!attr->ingress)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > +				 attr, "only ingress is supported");
> > +		if (attr->egress)
> > +			return rte_flow_error_set
> > +				(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ATTR_INGRESS,
> > +				 attr, "egress is not supported");
> > +		if (size < mnl_nlmsg_size(sizeof(*tcm)))
> > +			goto error_nobufs;
> > +		nlh = mnl_nlmsg_put_header(buf);
> > +		nlh->nlmsg_type = 0;
> > +		nlh->nlmsg_flags = 0;
> > +		nlh->nlmsg_seq = 0;
> > +		tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > +		tcm->tcm_family = AF_UNSPEC;
> > +		tcm->tcm_ifindex = ptoi[0].ifindex;
> > +		/*
> > +		 * Let kernel pick a handle by default. A predictable handle
> > +		 * can be set by the caller on the resulting buffer through
> > +		 * mlx5_nl_flow_brand().
> > +		 */
> > +		tcm->tcm_handle = 0;
> > +		tcm->tcm_parent = TC_H_MAKE(TC_H_INGRESS, TC_H_MIN_INGRESS);
> > +		/*
> > +		 * Priority cannot be zero to prevent the kernel from
> > +		 * picking one automatically.
> > +		 */
> > +		tcm->tcm_info = TC_H_MAKE((attr->priority + 1) << 16,
> > +					  RTE_BE16(ETH_P_ALL));
> > +		break;
> > +	case PATTERN:
> > +		if (!mnl_attr_put_strz_check(buf, size, TCA_KIND, "flower"))
> > +			goto error_nobufs;
> > +		na_flower = mnl_attr_nest_start_check(buf, size, TCA_OPTIONS);
> > +		if (!na_flower)
> > +			goto error_nobufs;
> > +		if (!mnl_attr_put_u32_check(buf, size, TCA_FLOWER_FLAGS,
> > +					    TCA_CLS_FLAGS_SKIP_SW))
> > +			goto error_nobufs;
> > +		break;
> > +	case ITEM_VOID:
> > +		if (item->type != RTE_FLOW_ITEM_TYPE_VOID)
> > +			goto trans;
> > +		++item;
> > +		break;
> > +	case ACTIONS:
> > +		if (item->type != RTE_FLOW_ITEM_TYPE_END)
> > +			goto trans;
> > +		assert(na_flower);
> > +		assert(!na_flower_act);
> > +		na_flower_act =
> > +			mnl_attr_nest_start_check(buf, size, TCA_FLOWER_ACT);
> > +		if (!na_flower_act)
> > +			goto error_nobufs;
> > +		break;
> > +	case ACTION_VOID:
> > +		if (action->type != RTE_FLOW_ACTION_TYPE_VOID)
> > +			goto trans;
> > +		++action;
> > +		break;
> > +	case END:
> > +		if (item->type != RTE_FLOW_ITEM_TYPE_END ||
> > +		    action->type != RTE_FLOW_ACTION_TYPE_END)
> > +			goto trans;
> > +		if (na_flower_act)
> > +			mnl_attr_nest_end(buf, na_flower_act);
> > +		if (na_flower)
> > +			mnl_attr_nest_end(buf, na_flower);
> > +		nlh = buf;
> > +		return nlh->nlmsg_len;
> > +	}
> > +	back = trans;
> > +	trans = mlx5_nl_flow_trans[trans[n - 1]];
> > +	n = 0;
> > +	goto trans;
> > +error_nobufs:
> > +	if (buf != buf_tmp) {
> > +		buf = buf_tmp;
> > +		size = sizeof(buf_tmp);
> > +		goto init;
> > +	}
> 
> Continuing my comment above.
> This part is unclear. It looks to me that this func does:
> 
> 1) if size is zero, consider it as a testing call to know the amount of memory
> required.

Yeah, in fact this one is a shortcut to speed up this specific scenario as
it happens all the time in the two-pass use case. You can lump it with 2).

> 2) if size isn't zero but not enough, it stops writing to buf and start over to
> return the amount of memory required instead of returning error.
> 3) if size isn't zero and enough, it fills in buf.
> 
> Do I correctly understand?

Yes. Another minor note for 2), the returned buffer is also filled up to the
point of failure (mimics snprintf()).

Perhaps the following snippet can better summarize the envisioned approach:

 int ret = snprintf(NULL, 0, "something", ...);

 if (ret < 0) {
     goto court;
 } else {
     char buf[ret];

     snprintf(buf, sizeof(buf), "something", ...); /* Guaranteed. */
     [...]
 }

-- 
Adrien Mazarguil
6WIND


More information about the dev mailing list