[dpdk-dev] DPDK future qos work

Robertson, Alan alan.robertson at intl.att.com
Wed Dec 19 17:58:13 CET 2018
Previous message: [dpdk-dev] [PATCH v2 0/4] net/cxgbe: fix build for Microsoft Windows OS support
Next message: [dpdk-dev] [PATCH v2 0/2] Add 'try' semantics for RD and WR locking
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Cristian and Jasvinder,

AT&T has done similar work in these areas, we have provided patches which we
hope can be used as a basis for future development.
In addition to these patches we have developed other features such as multiple
WRED profiles per queue and can provide patches if you feel it's of benefit
to the DPDK. We've provided comments inline (see AT&T>).

>
> Hi guys,
>
> Here is a list of incremental features and improvements we are considering
> to prototype and add to the DPDK hierarchical scheduler SW library.
> This list is driven by our findings as well as feedback from various users.
> Please take a look and feel free to add more features to this list or comment
> on the features below. Of course, all these items are subject to preserving
> the functional correctness, existing accuracy and performance of the current
> implementation.
>
> 1. Pipe level: Increase number of traffic classes (TCs). Allow a more
> flexible mapping of the pipe queues to traffic classes. Do not allocate
> memory for queues that are not needed.
> a) Currently, each pipe has 16 queues that are hardwired into 4 TCs
> scheduled with strict priority (SP), and each TC has exactly  with 4 queues
> that are scheduled with Weighted  Fair Queuing (WFQ). Specifically,
> TC0 = [Queue 0 .. Queue 3], TC1 = [Queue 4 .. Queue 7],
> TC2 = [Queue 8 .. Queue 11], TC3 = [Queue 12 .. Queue 15].
> b) The plan is to support up to 16 TCs. All the high priority TCs
> (TC1, TC2, ...) will have exactly 1 queue, while the lowest priority TC,
> called Best Effort (BE), has 1, 4 or 8 queues. This is justified by the
> fact that typically all the high priority TCs are fully provisioned (small
> to medium traffic rates), while most of the traffic fits into the BE class,
> which is usually greatly oversubscribed.
> c) This leads to the following valid options for mapping pipe queues to TCs:
>             i. BE class has 1 queue => Max number of TCs is 16
>             ii. BE class has 4 queues => Max number of TCs is 13
>             iii. BE class has 8 queues => Max number of TCs is 9
> d) In order to keep implementation complexity under control, it is required
> that all pipes from the same subport share the same mapping of pipe queues
> to TCs.
> e) Currently, all the 16 pipe queues have to be configured (and memory
> allocated for them internally), even if not all of them are needed.
> Going forward, it shall be allowed to use less than 16 queues per pipe when
> not all the 16 queues are needed, and no memory shall be allocated for the
> queues that are not needed.

AT&T>  This is very similar to the patches we've provided to parameterize
the queue allocation in a TC.  AT&T has a standard deployment model which
needs 3/4 levels of prioritization with the BE traffic requiring 5 queues
using WRR.   The first 2/3 levels of prioritization all use a single queue,
one for network management traffic, one for real-time traffic and one for
local control traffic. This fits very closely with what you describe here.
Since we needed to provide 5 queues in a single TC for the BE traffic we've
increased the queue allocation to 8 queues per TC to keep it a power of 2
as the current implementation does.  Since we use yang as the configuration
interface we can't reduce ranges since it breaks backwards compatibility.
If the proposed implementation can be parameterized in such away that AT&T
can increase a queue range where necessary then it will meet our
requirements and give us a reduced memory footprint.

Will the number of TCs and queues per pipe be the same for all pipes in a
subport or port or can they vary ?

> 2. Subport level: Allow different subports of the same port to have
> different configuration in terms of number of pipes, pipe queue sizes,
> pipe queue mapping to traffic classes, etc.
> a) In order to keep the implementation complexity under control, it is
> required that all pipes within the same subport share the same configuration
> for these parameters.
> b) Internal implications: each subport port will likely need to have its
> own bitmap data structure.

AT&T>  We have provided a patch to allow each subport to have different
queue-sizes per TC.  This requirement came about because AT&T uses a
different VLAN per customer (in our implementation a VLAN is a subport)
who can purchase different amounts of bandwidth.  Each traffic class has
a latency gaurantee so the queue-lengths are configured accordingly.

> 3. Redistribution of unused pipe BW to other pipes within the same subport:
> Enable the existing oversubscription mechanism by default.
> a) Currently, this mechanism needs to be explicitly enabled at build time.
> b) This change is subject to performance impact not going to be significant.

AT&T>  What will happen with the current maxrate implementation ?
Will it still be there as a compile time option or will this be tunable at
runtime ?   We'd probably want to be able to keep the current implementation
so that existing deployments didn't change behaviour during a future upgrade.

> 4. Pipe TC level: Improve shaper accuracy.
> a) The current pipe TC rate limiting mechanism is not robust and it can
> result in deadlock for certain configurations. Currently, the pipe TC
> credits are periodically cleared and re-initialized to a fixed value
> (period is configurable), which can result in deadlock if number of pipe
> TC credits is smaller than the MTU.
> b) The plan is to move the pipe TC rate limiting from the scheduler dequeue
> operation (shaping) to the scheduler enqueue operation (metering), by using
> one token bucket per pipe TC. Basically, packets that exceed the pipe TC
> rate will be detected and dropped earlier rather than later, which should
> be beneficial from the perspective of not spending cycles on packets that
> are later going to dropped anyway.
> c) Internal implications: Number of token buckets is multiplied 16 times.
> Need to improve the token bucket performance (e.g. by using branchless
> code) in order to get back some of the performance.

AT&T>  We've seen performance issues with the shaper.   The first issue was
large frames can block the shaper if the burst rate is less than the maximum
MTU.   To get round this we increase the burst to accommodate a maximum MTU
and scale the TC accordingly to keep the rate the same.  We've also noticed
the timer slips resulting in reduced throughput.  We've also enabled
borrowing against the token bucket so if there's > 1 byte of credit a packet
will be sent, to make sure there's no oversubscription the token bucket
is decremented to a negative value.

What you descibe about metering is effectively changing the implementation
into a conditional policer.  Other implementations support policers but
normally only for real time traffic.  This has the advantage of giving
more accurate latency gaurantees since the queue size is now effectively
measured in bytes.  NB 2) above, with the configurable queue-limit per
subport allows for better queue length configurability so helps alleviate
the issue and possibly reducing the need for a policer.   Will this only
be targetted for real-time queues ?  Note that this will change the
behaviour of BE traffic since it will be policed instead of shaped,
shaping allows bursts to be smoothed thereby avoiding drops.
Bursty traffic doesn't work well with policers since a policer really needs
to accommodate a token bucket of 2 BCs since the adjacent router won't have
a synchronized TC thereby meaning it can send 2 BCs worth in one of the
local routers TC intervals.

> Best regards,
> Your faithful DPDK QoS implementers,
> Cristian and Jasvinder
Previous message: [dpdk-dev] [PATCH v2 0/4] net/cxgbe: fix build for Microsoft Windows OS support
Next message: [dpdk-dev] [PATCH v2 0/2] Add 'try' semantics for RD and WR locking
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list