[PATCH v1 00/10] net/mlx5: per-queue Tx rate limiting via packet pacing
Stephen Hemminger
stephen at networkplumber.org
Tue Mar 10 15:20:04 CET 2026
On Tue, 10 Mar 2026 10:20:04 +0100
Vincent Jardin <vjardin at free.fr> wrote:
> This series adds per-queue Tx rate limiting to the mlx5 PMD using
> the HW packet pacing (PP) rate table.
>
> The ConnectX-6 Dx and later NICs expose a per-SQ
> packet_pacing_rate_limit_index that can be changed on a live SQ
> via modify_bitmask without queue teardown. The kernel mlx5 driver
> refcounts PP contexts internally, so queues configured at the same
> rate share a single HW rate table entry.
>
> The series is structured as follows:
>
> 1. Doc fix for stale packet pacing documentation
> 2-3. common/mlx5: query PP capabilities and extend SQ modify
> 4-6. net/mlx5: per-queue PP infrastructure, rate_limit callback,
> burst pacing devargs (tx_burst_bound, tx_typical_pkt_sz)
> 7. net/mlx5: testpmd command to query per-queue rate state
> 8. ethdev: add rte_eth_get_queue_rate_limit() symmetric getter
> 9. net/mlx5: share PP rate table entries across queues
> 10. net/mlx5: rate table capacity query API
>
> Usage with testpmd:
> set port 0 queue 0 rate 1000
> set port 0 queue 1 rate 5000
> set port 0 queue 0 rate 0 # disable
> mlx5 port 0 txq 0 rate show # query
>
> Tested on ConnectX-6 Dx only.
>
> Vincent Jardin (11):
> doc/nics/mlx5: fix stale packet pacing documentation
> common/mlx5: query packet pacing rate table capabilities
> common/mlx5: extend SQ modify to support rate limit update
> net/mlx5: add per-queue packet pacing infrastructure
> net/mlx5: support per-queue rate limiting
> net/mlx5: add burst pacing devargs
> net/mlx5: add testpmd command to query per-queue rate limit
> ethdev: add getter for per-queue Tx rate limit
> mailmap: update Vincent Jardin email address
> net/mlx5: share pacing rate table entries across queues
> net/mlx5: add rate table capacity query API
>
> .mailmap | 3 +-
> doc/guides/nics/mlx5.rst | 125 +++++++++++++++++------
> drivers/common/mlx5/mlx5_devx_cmds.c | 20 ++++
> drivers/common/mlx5/mlx5_devx_cmds.h | 14 ++-
> drivers/net/mlx5/mlx5.c | 46 +++++++++
> drivers/net/mlx5/mlx5.h | 13 +++
> drivers/net/mlx5/mlx5_testpmd.c | 93 +++++++++++++++++
> drivers/net/mlx5/mlx5_tx.c | 89 +++++++++++++++++
> drivers/net/mlx5/mlx5_tx.h | 5 +
> drivers/net/mlx5/mlx5_txpp.c | 75 ++++++++++++++
> drivers/net/mlx5/mlx5_txq.c | 144 +++++++++++++++++++++++++++
> drivers/net/mlx5/rte_pmd_mlx5.h | 57 +++++++++++
> lib/ethdev/ethdev_driver.h | 7 ++
> lib/ethdev/rte_ethdev.c | 28 ++++++
> lib/ethdev/rte_ethdev.h | 24 +++++
> 15 files changed, 710 insertions(+), 33 deletions(-)
>
Lots to digest here, so did a first pass review using AI.
Review: [PATCH v1 00/10] mlx5 per-queue packet pacing rate limiting
Submitter: Vincent Jardin <vjardin at free.fr>
================================================================
Patch 4/10: net/mlx5: add per-queue packet pacing infrastructure
================================================================
Error: Integer overflow in Mbps-to-kbps conversion
mlx5_txq_alloc_pp_rate_limit() computes:
rate_kbps = rate_mbps * 1000;
Both rate_mbps and rate_kbps are uint32_t. If rate_mbps > 4,294,967
(roughly 4.3 Tbps), the multiplication overflows and silently wraps,
programming a wrong rate into the HW rate table. While 4 Tbps is
beyond current HW, the function has no upper-bound validation against
hca_attr.qos.packet_pacing_max_rate. At minimum, validate rate_mbps
against the HCA max rate before the multiply, or widen to uint64_t:
uint64_t rate_kbps = (uint64_t)rate_mbps * 1000;
and check it fits in the 32-bit PRM field.
Warning: No validation of rate_mbps against HCA min/max rate
The HCA reports packet_pacing_min_rate and packet_pacing_max_rate
(queried in patch 2). mlx5_txq_alloc_pp_rate_limit() does not check
the requested rate against these bounds. A rate below the HW minimum
will likely be silently rounded or rejected by FW with an opaque
error. Validating early with a clear log message would be more
helpful.
================================================================
Patch 5/10: net/mlx5: support per-queue rate limiting
================================================================
Error: PP index leaked when mlx5_devx_cmd_modify_sq() fails on rate set
In mlx5_set_queue_rate_limit() for the tx_rate > 0 path:
ret = mlx5_txq_alloc_pp_rate_limit(sh, &txq_ctrl->rl, tx_rate);
if (ret)
return ret;
...
ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr);
if (ret) {
...
mlx5_txq_free_pp_rate_limit(&txq_ctrl->rl);
return ret;
}
This looks correct on error -- good. However, note that
mlx5_txq_alloc_pp_rate_limit() calls mlx5_txq_free_pp_rate_limit()
internally at the top ("Free previous allocation if any"), meaning
the OLD PP index is freed BEFORE the new one is allocated, and before
the SQ is modified. If the new allocation succeeds but modify_sq
fails, the SQ still has the OLD pp_id programmed, but that old PP
context was already freed. The SQ is now referencing a freed PP index
until the next successful set or queue teardown.
Suggested fix: Do not free the old PP context inside
mlx5_txq_alloc_pp_rate_limit(). Instead, allocate the new PP into a
temporary mlx5_txq_rate_limit, modify the SQ, and only on success
free the old PP and swap in the new one. On failure, free the new PP
and leave the old one intact.
Warning: mlx5_set_queue_rate_limit return value inconsistency
The disable path (tx_rate == 0) returns the raw ret from
mlx5_devx_cmd_modify_sq() without setting rte_errno, while the
capability-check and validation paths return -rte_errno. The ethdev
layer expects negative errno values. Verify that
mlx5_devx_cmd_modify_sq() returns negative errno consistently.
================================================================
Patch 7/10: net/mlx5: add testpmd command to query per-queue rate
================================================================
Error: Inverted return value check for rte_eth_tx_queue_is_valid()
In rte_pmd_mlx5_txq_rate_limit_query():
if (rte_eth_tx_queue_is_valid(port_id, queue_id))
return -EINVAL;
rte_eth_tx_queue_is_valid() returns 0 on success (valid queue) and
non-zero on error. This check returns -EINVAL when the queue IS
valid, and proceeds when it is NOT valid -- the logic is inverted.
Should be:
if (!rte_eth_tx_queue_is_valid(port_id, queue_id))
or:
if (rte_eth_tx_queue_is_valid(port_id, queue_id) != 0)
Without this fix, the query always fails on valid queues and may
dereference invalid memory on invalid queues.
Warning: SQ object field mismatch between set and query paths
In mlx5_set_queue_rate_limit() (patch 7 update), the code uses
txq_ctrl->obj->sq_obj.sq for modify_sq. But in
rte_pmd_mlx5_txq_rate_limit_query(), the query uses
txq_ctrl->obj->sq_obj.sq as well (via sq_out). Make sure
mlx5_devx_cmd_query_sq() exists and accepts this object -- the
existing codebase has mlx5_devx_cmd_query_sq() but verify the
parameter is the same sq_obj.sq vs sq distinction.
================================================================
Patch 8/10: ethdev: add getter for per-queue Tx rate limit
================================================================
Warning: New ethdev API needs broader discussion
Adding a new eth_dev_ops callback (get_queue_rate_limit) changes the
ethdev driver interface. This typically requires:
- An RFC or discussion on the mailing list before the patch
- Agreement from ethdev maintainers (Thomas, Andrew, Ferruh)
- Consideration of whether this belongs as a generic ethdev API or
should remain PMD-specific
The patch adds it only to mlx5; other PMDs that support
set_queue_rate_limit (ixgbe, i40e, ice) would need implementations
too, or applications will get inconsistent -ENOTSUP.
Warning: Missing release notes for new ethdev API
rte_eth_get_queue_rate_limit() is a new public experimental API in
lib/ethdev. This needs an entry in doc/guides/rel_notes/ for the
26.07 release.
Warning: Missing RTE_EXPORT_EXPERIMENTAL_SYMBOL for mlx5 PMD functions
rte_pmd_mlx5_txq_rate_limit_query (patch 7) has the export macro.
Verify rte_pmd_mlx5_pp_rate_table_query (patch 10) does too -- it
appears to, which is good.
================================================================
Patch 10/10: net/mlx5: add rate table capacity query API
================================================================
Warning: VLA-sized stack array in rte_pmd_mlx5_pp_rate_table_query()
The function declares:
uint16_t seen[RTE_MAX_QUEUES_PER_PORT];
RTE_MAX_QUEUES_PER_PORT is typically 1024, so this is a 2KB stack
allocation. While not enormous, this is inside a function that could
be called from any context. The O(n*m) dedup loop (for each queue,
scan seen[] for duplicates) is also inefficient for large queue
counts. Consider using a bitmap or a small hash set, or just
counting unique pp_ids from the shared context rather than scanning
all queues.
Warning: "used" count only reflects this port's queues
The function counts unique PP indices across priv->txqs_n queues
for one port, but the HW rate table is shared across all ports on
the same device (shared context). If multiple ports share the same
PCI device, the "used" count will underreport. Consider iterating
over all ports on the same sh, or documenting that the count is
per-port only.
================================================================
General series observations
================================================================
Info: Patch 1 (doc fix) has a Fixes tag referencing the original
scheduling devargs commit, which is appropriate. However, patches
2-10 are new features and should not have Fixes tags (they don't,
which is correct).
Info: The series adds three new experimental PMD APIs and one new
ethdev API. All new APIs in headers have __rte_experimental on
their own line, which is correct. The export macros use the 26.07
version tag.
Info: The series lacks a cover letter in this bundle. For a 10-patch
series adding a significant new feature, a cover letter explaining
the overall design and testing would be expected by reviewers.
More information about the dev
mailing list