[dpdk-dev] [PATCH v3 2/2] ethdev: add hierarchical scheduler API

Jerin Jacob jerin.jacob at caviumnetworks.com
Fri Apr 7 15:20:27 CEST 2017


-----Original Message-----
> Date: Sat, 4 Mar 2017 01:10:20 +0000
> From: Cristian Dumitrescu <cristian.dumitrescu at intel.com>
> To: dev at dpdk.org
> CC: thomas.monjalon at 6wind.com, jerin.jacob at caviumnetworks.com,
>  balasubramanian.manoharan at cavium.com, hemant.agrawal at nxp.com,
>  shreyansh.jain at nxp.com
> Subject: [PATCH v3 2/2] ethdev: add hierarchical scheduler API
> X-Mailer: git-send-email 2.5.0
> 

Thanks Cristian for v3.

>From Cavium HW perceptive, v3 is in relatively good shape to consume it,
Except the below mentioned two pointers.

1) We strongly believes, application explicitly need to give level id, while
creating topology(i.e rte_tm_node_add()). Reasons are,

- In the capability side we are exposing nr_levels etc
- I think, _All_ the HW implementation expects to be connected from
level-n to leveln-1. IMO, Its better to express that in API.
- For the SW implementations, which don't care abut the specific level id for the
  connection can ignore the level id passed by the application.
  Let the definition be "level" aware.

2) There are lot of capability in the TM definition. I don't have strong option
here as TM stuff comes in control path.

So expect point (1), generally we are fine with V3.

Detailed comments below,

> +
> +#ifndef __INCLUDE_RTE_TM_H__
> +#define __INCLUDE_RTE_TM_H__
> +
> +/**
> + * @file
> + * RTE Generic Traffic Manager API
> + *
> + * This interface provides the ability to configure the traffic manager in a
> + * generic way. It includes features such as: hierarchical scheduling,
> + * traffic shaping, congestion management, packet marking, etc.
> + */

Fix missing API documentation doxygen hooks.

Files: doc/api/doxy-api-index.md and doc/api/doxy-api.conf.
Ref:
http://dpdk.org/browse/dpdk/commit/?id=71f2384328651dced05eceee87119a71f0cf16a7 


> +
> +#include <stdint.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/** Ethernet framing overhead
> + *
> + * Overhead fields per Ethernet frame:
> + * 1. Preamble:                                            7 bytes;
> + * 2. Start of Frame Delimiter (SFD):                      1 byte;
> + * 3. Inter-Frame Gap (IFG):                              12 bytes.
> + */
> +#define RTE_TM_ETH_FRAMING_OVERHEAD                  20

This definition is not used anywhere

> +/**
> + * Color
> + */
> +enum rte_tm_color {
> +	RTE_TM_GREEN = 0, /**< Green */

Explicit zero assignment is not required.

> +	RTE_TM_YELLOW, /**< Yellow */
> +	RTE_TM_RED, /**< Red */
> +	RTE_TM_COLORS /**< Number of colors */
> +};
> +
> +/**
> + * Node statistics counter type
> + */
> +enum rte_tm_stats_type {
> +	/**< Number of packets scheduled from current node. */
> +	RTE_TM_STATS_N_PKTS = 1 << 0,
> +
> +	RTE_TM_STATS_N_PKTS_QUEUED = 1 << 8,
> +
> +	/**< Number of bytes currently waiting in the packet queue of current
> +	 * leaf node.
> +	 */
> +	RTE_TM_STATS_N_BYTES_QUEUED = 1 << 9,

I think, For bitmask, it is better to add ULL.
example:
        RTE_TM_STATS_N_BYTES_QUEUED = 1ULL << 9,
> +};

I think, The above definitions are used as "uint64_t stats_mask" in
the remaining sections. How about changing to "enum rte_tm_stats_type stats_mask"
instead of uint64_t stats_mask?

> +
> +/**
> + * Traffic manager dynamic updates
> + */
> +enum rte_tm_dynamic_update_type {
> +	/**< Dynamic parent node update. The new parent node is located on same
> +	 * hierarchy level as the former parent node. Consequently, the node
> +	 * whose parent is changed preserves its hierarchy level.
> +	 */
> +	/**< Dynamic update of the set of enabled stats counter types. */
> +	RTE_TM_UPDATE_NODE_STATS = 1 << 5,
> +
> +	/**< Dynamic update of congestion management mode for leaf nodes. */
> +	RTE_TM_UPDATE_NODE_CMAN = 1 << 6,
> +};

Same as above comment on enum.

> +struct rte_tm_level_capabilities {

IMO, level can be either leaf or nonleaf. If so, following struct makes more
sense to me

        int is_leaf;
        uint32_t n_nodes_max;
        union {
                struct rte_tm_node_capabilities nonleaf;
                struct rte_tm_node_capabilities leaf;
        };

> +	/**< Maximum number of nodes for the current hierarchy level. */
> +	uint32_t n_nodes_max;
> +
> +	/**< Maximum number of non-leaf nodes for the current hierarchy level.
> +	 * The value of 0 indicates that current level only supports leaf
> +	 * nodes. The maximum value is *n_nodes_max*.
> +	 */
> +	uint32_t n_nodes_nonleaf_max;
> +
> +	/**< Maximum number of leaf nodes for the current hierarchy level. The
> +	 * value of 0 indicates that current level only supports non-leaf
> +	 * nodes. The maximum value is *n_nodes_max*.
> +	 */
> +	uint32_t n_nodes_leaf_max;
> +
> +	/**< Summary of node-level capabilities across all the non-leaf nodes
> +	 * of the current hierarchy level. Valid only when
> +	 * *n_nodes_nonleaf_max* is greater than 0.
> +	 */
> +	struct rte_tm_node_capabilities nonleaf;
> +
> +	/**< Summary of node-level capabilities across all the leaf nodes of
> +	 * the current hierarchy level. Valid only when *n_nodes_leaf_max* is
> +	 * greater than 0.
> +	 */
> +	struct rte_tm_node_capabilities leaf;
> +};
> +
> +/**
> + * Traffic manager capabilities
> + */
> +struct rte_tm_capabilities {
> +	/**< Maximum number of nodes. */
> +	uint32_t n_nodes_max;
> +
> +	/**< Set of supported dynamic update operations
> +	 * (see enum rte_tm_dynamic_update_type).
> +	 */
> +	uint64_t dynamic_update_mask;

IMO, It is better to change as
enum rte_tm_dynamic_update_type dynamic_update_mask
instead of
uint64_t dynamic_update_mask;

> +
> +	/**< Summary of node-level capabilities across all non-leaf nodes. */
> +	struct rte_tm_node_capabilities nonleaf;
> +
> +	/**< Summary of node-level capabilities across all leaf nodes. */
> +	struct rte_tm_node_capabilities leaf;
> +};
> +
> +/**
> + * Congestion management (CMAN) mode
> + *
> + * This is used for controlling the admission of packets into a packet queue or
> + * group of packet queues on congestion. On request of writing a new packet
> + * into the current queue while the queue is full, the *tail drop* algorithm
> + * drops the new packet while leaving the queue unmodified, as opposed to *head
> + * drop* algorithm, which drops the packet at the head of the queue (the oldest
> + * packet waiting in the queue) and admits the new packet at the tail of the
> + * queue.
> + *
> + * The *Random Early Detection (RED)* algorithm works by proactively dropping
> + * more and more input packets as the queue occupancy builds up. When the queue
> + * is full or almost full, RED effectively works as *tail drop*. The *Weighted
> + * RED* algorithm uses a separate set of RED thresholds for each packet color.
> + */
> +enum rte_tm_cman_mode {
> +	RTE_TM_CMAN_TAIL_DROP = 0, /**< Tail drop */

explicit zero assignment may not be required

> +	RTE_TM_CMAN_HEAD_DROP, /**< Head drop */
> +	RTE_TM_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
> +};
> +

> +struct rte_tm_node_params {
> +	/**< Shaper profile for the private shaper. The absence of the private
> +	 * shaper for the current node is indicated by setting this parameter
> +	 * to RTE_TM_SHAPER_PROFILE_ID_NONE.
> +	 */
> +	uint32_t shaper_profile_id;
> +
> +	/**< User allocated array of valid shared shaper IDs. */
> +	uint32_t *shared_shaper_id;
> +
> +	/**< Number of shared shaper IDs in the *shared_shaper_id* array. */
> +	uint32_t n_shared_shapers;
> +
> +	/**< Mask of statistics counter types to be enabled for this node. This
> +	 * needs to be a subset of the statistics counter types available for
> +	 * the current node. Any statistics counter type not included in this
> +	 * set is to be disabled for the current node.
> +	 */
> +	uint64_t stats_mask;

How about changing to "enum rte_tm_stats_type" instead of uint64_t ?

> +
> +	union {
> +		/**< Parameters only valid for non-leaf nodes. */
> +		struct {
> +			/**< For each priority, indicates whether the children
> +			 * nodes sharing the same priority are to be scheduled
> +			 * by WFQ or by WRR. When NULL, it indicates that WFQ
> +			 * is to be used for all priorities. When non-NULL, it
> +			 * points to a pre-allocated array of *n_priority*
> +			 * elements, with a non-zero value element indicating
> +			 * WFQ and a zero value element for WRR.
> +			 */
> +			int *scheduling_mode_per_priority;
> +
> +			/**< Number of priorities. */
> +			uint32_t n_priorities;
> +		} nonleaf;


Since we are adding all node "connecting" parameter in rte_tm_node_add().
How about adding WFQ vs WRR as boolean value in rte_tm_node_add() to maintain
the consistency

How about new error type in "enum rte_tm_error_type" to specify the connection
error due to requested mode WFQ or WRR not supported.

> +
> +/**
> +/**
> + * Traffic manager get number of leaf nodes
> + *
> + * Each leaf node sits on on top of a TX queue of the current Ethernet port.
> + * Therefore, the set of leaf nodes is predefined, their number is always equal
> + * to N (where N is the number of TX queues configured for the current port)
> + * and their IDs are 0 .. (N-1).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param n_leaf_nodes
> + *   Number of leaf nodes for the current port.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int
> +rte_tm_get_leaf_nodes(uint8_t port_id,
> +	uint32_t *n_leaf_nodes,
> +	struct rte_tm_error *error);

In order to keep consistency with rest of the API, IMO, the API
name can be changed to rte_tm_leaf_nodes_get()

> +
> +/**
> + * Traffic manager node type (i.e. leaf or non-leaf) get
> + *
> + * The leaf nodes have predefined IDs in the range of 0 .. (N-1), where N is
> + * the number of TX queues of the current Ethernet port. The non-leaf nodes
> + * have their IDs generated by the application outside of the above range,
> + * which is reserved for leaf nodes.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID value. Needs to be valid.
> + * @param is_leaf

Change to "@param[out] is_leaf" to indicate the parameter is output.
I guess, That scheme is missing in overall header file. It is good to have.


> + *   Set to non-zero value when node is leaf and to zero otherwise (non-leaf).
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int
> +rte_tm_node_type_get(uint8_t port_id,
> +	uint32_t node_id,
> +	int *is_leaf,
> +	struct rte_tm_error *error);
> +
> +/**
> + * Traffic manager capabilities get
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param cap

missing [out]. See above

> + *   Traffic manager capabilities. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int
> +rte_tm_capabilities_get(uint8_t port_id,
> +	struct rte_tm_capabilities *cap,
> +	struct rte_tm_error *error);
> +

> +int
> +rte_tm_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_tm_node_params *params,
> +	struct rte_tm_error *error);

See the first comment in the beginning of the file.

> +
> + * Traffic manager node parent update
> + *
> + * Restriction for root node: its parent cannot be changed.

IMO, it is nice to mention correspond "enum rte_tm_dynamic_update_type"
flag for this API support here. May be in common code itself we can check that and
return error if implementation does not meet the capability.

Applicable to all update APIs

> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param parent_node_id
> + *   Node ID for the new parent. Needs to be valid.
> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is zero. Used by the
> + *   WFQ/WRR algorithm running on the parent of the current node for scheduling
> + *   this child node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int
> +rte_tm_node_parent_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_tm_error *error);
> +


More information about the dev mailing list