[dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Stephen Hemminger stephen at networkplumber.org
Tue Dec 6 20:51:24 CET 2016


On Wed, 30 Nov 2016 18:16:50 +0000
Cristian Dumitrescu <cristian.dumitrescu at intel.com> wrote:

> This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
> hierarchical scheduler. The goal of the abstraction layer is to provide a simple
> generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
> implementation.
> 
> Q1: What is the benefit for having an abstraction layer for QoS hierarchical
> layer?
> A1: There is growing interest in the industry for handling various HW-based,
> SW-based or mixed hierarchical scheduler implementations using a unified DPDK
> API.
> 
> Q2: Which devices are targeted by this abstraction layer?
> A2: All current and future devices that expose a hierarchical scheduler feature
> under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.
> 
> Q3: Which scheduler hierarchies are supported by the API?
> A3: Hopefully any scheduler hierarchy can be described and covered by the
> current API. Of course, functional correctness, accuracy and performance levels
> depend on the specific implementations of this API.
> 
> Q4: Why have this abstraction layer into ethdev as opposed to a new type of
> device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
> A4: Packets are sent to the Ethernet device using the ethdev API
> rte_eth_tx_burst() function, with the hierarchical scheduling taking place
> automatically (i.e. no SW intervention) in HW implementations. Basically, the
> hierarchical scheduler is done as part of packet TX operation.
> The hierarchical scheduler is typically the last stage before packet TX and it
> is tightly integrated with the TX stage. The hierarchical scheduler is just
> another offload feature of the Ethernet device, which needs to be accommodated
> by the ethdev API similar to any other offload feature (such as RSS, DCB,
> flow director, etc).
> Once the decision to schedule a specific packet has been taken, this packet
> cannot be dropped and it has to be sent over the wire as is, otherwise what
> takes place on the wire is not what was planned at scheduling time, so the
> scheduling is not accurate (Note: there are some devices which allow prepending
> headers to the packet after the scheduling stage at the expense of sending
> correction requests back to the scheduler, but this only strengthens the bond
> between scheduling and TX).
> 
> Q5: Given that the packet scheduling takes place automatically for pure HW
> implementations, how does packet scheduling take place for poll-mode SW
> implementations?
> A5: The API provided function rte_sched_run() is designed to take care of this.
> For HW implementations, this function typically does nothing. For SW
> implementations, this function is typically expected to perform dequeue of
> packets from the hierarchical scheduler and their write to Ethernet device TX
> queue, periodic flush of any buffers on enqueue-side into the hierarchical
> scheduler for burst-oriented implementations, etc.
> 
> Q6: Which are the scheduling algorithms supported?
> A6: The fundamental scheduling algorithms that are supported are Strict Priority
> (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
> the level of each node of the scheduling hierarchy, regardless of the node
> level/position in the tree. The SP algorithm is used to schedule between sibling
> nodes with different priority, while WFQ is used to schedule between groups of
> siblings that have the same priority.
> Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
> (DWRR), etc are considered approximations of the ideal WFQ and are therefore
> assimilated to WFQ, although an associated implementation-dependent accuracy,
> performance and resource usage trade-off might exist.
> 
> Q7: Which are the supported congestion management algorithms?
> A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
> available for every leaf node in the hierarchy, subject to the specific
> implementation supporting them.
> 
> Q8: Is traffic shaping supported?
> A8: Yes, there are a number of shapers (rate limiters) that can be supported for
> each node in the hierarchy (built-in limit is currently set to 4 per node). Each
> shaper can be private to a node (used only by that node) or shared between
> multiple nodes.
> 
> Q9: What is the purpose of having shaper profiles and WRED profiles?
> A9: In most implementations, many shapers typically share the same configuration
> parameters, so defining shaper profiles simplifies the configuration task. Same
> considerations apply to WRED contexts and profiles.
> 
> Q10: How is the scheduling hierarchy defined and created?
> A10: Scheduler hierarchy tree is set up by creating new nodes and connecting
> them to other existing nodes, which thus become parent nodes. The unique ID that
> is assigned to each node when the node is created is further used to update the
> node configuration or to connect children nodes to it. The leaf nodes of the
> scheduler hierarchy are each attached to one of the Ethernet device TX queues.
> 
> Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
> A11: Yes. The actual changes take place subject to the specific implementation
> supporting them, otherwise error code is returned.
> 
> Q12: What is the typical function call sequence to set up and run the Ethernet
> device scheduler?
> A12: The typical simplified function call sequence is listed below:
> i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),
> rte_eth_tx_queue_setup()
> ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:
> rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),
> rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()
> iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()
> iv) Freeze the start-up hierarchy and ask the device whether it supports it:
> rte_eth_sched_node_add()
> v) Start the Ethernet port: rte_eth_dev_start()
> vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),
> rte_eth_sched_node_<attribute>_set()
> vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()
> viii) Run-time support for SW poll-mode implementations (see previous answer):
> rte_sched_run()
> 
> Q13: Which are the possible options for the user when the Ethernet port does not
> support the scheduling hierarchy required by the user?
> A13: The following options are available to the user:
> i) abort
> ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable
> iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
> front-end implementing the hierarchical scheduler (e.g. existing DPDK library
> librte_sched); instantiate the new device type on-the-fly and check if the
> hierarchy requirements can be met by the new device.
> 
> 
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu at intel.com>

This seems to be more of an abstraction of existing QoS.
Why not something like Linux Qdisc or FreeBSD DummyNet/PF/ALTQ where the Qos
components are stackable objects? And why not make it the same as existing
OS abstractions? Rather than reinventing wheel which seems to be DPDK Standard
Procedure, could an existing abstraction be used?



More information about the dev mailing list