[PATCH v4 05/10] net/mlx5: support per-queue rate limiting
Vincent Jardin
vjardin at free.fr
Sun Mar 22 14:46:24 CET 2026
Wire rte_eth_set_queue_rate_limit() to the mlx5 PMD. The callback
allocates a per-queue PP index with the requested data rate, then
modifies the live SQ via modify_bitmask bit 0 to apply the new
packet_pacing_rate_limit_index — no queue teardown required.
Setting tx_rate=0 clears the PP index on the SQ and frees it.
Capability check uses hca_attr.qos.packet_pacing directly (not
dev_cap.txpp_en which requires Clock Queue prerequisites). This
allows per-queue rate limiting without the tx_pp devarg.
The callback rejects hairpin queues and queues whose SQ is not
yet created.
testpmd usage (no testpmd changes needed):
set port 0 queue 0 rate 1000
set port 0 queue 1 rate 5000
set port 0 queue 0 rate 0 # disable
Supported hardware:
- ConnectX-6 Dx: full support, per-SQ rate via HW rate table
- ConnectX-7/8: full support, coexists with wait-on-time scheduling
- BlueField-2/3: full support as DPU rep ports
Not supported:
- ConnectX-5: packet_pacing exists but dynamic SQ modify may not
work on all firmware versions
- ConnectX-4 Lx and earlier: no packet_pacing capability
Signed-off-by: Vincent Jardin <vjardin at free.fr>
---
doc/guides/nics/features/mlx5.ini | 1 +
doc/guides/nics/mlx5.rst | 54 ++++++++++++++
drivers/net/mlx5/mlx5.c | 2 +
drivers/net/mlx5/mlx5_tx.h | 2 +
drivers/net/mlx5/mlx5_txq.c | 118 ++++++++++++++++++++++++++++++
5 files changed, 177 insertions(+)
diff --git a/doc/guides/nics/features/mlx5.ini b/doc/guides/nics/features/mlx5.ini
index 4f9c4c309b..3b3eda28b8 100644
--- a/doc/guides/nics/features/mlx5.ini
+++ b/doc/guides/nics/features/mlx5.ini
@@ -30,6 +30,7 @@ Inner RSS = Y
SR-IOV = Y
VLAN filter = Y
Flow control = Y
+Rate limitation = Y
CRC offload = Y
VLAN offload = Y
L3 checksum offload = Y
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 6bb8c07353..c72a60f084 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -580,6 +580,60 @@ for an additional list of options shared with other mlx5 drivers.
(with ``tx_pp``) and ConnectX-7+ (wait-on-time) scheduling modes.
The default value is zero.
+.. _mlx5_per_queue_rate_limit:
+
+Per-Queue Tx Rate Limiting
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The mlx5 PMD supports per-queue Tx rate limiting via the standard ethdev
+API ``rte_eth_set_queue_rate_limit()`` and ``rte_eth_get_queue_rate_limit()``.
+
+This feature uses the hardware packet pacing mechanism to enforce a data
+rate on individual TX queues without tearing down the queue. The rate is
+specified in Mbps.
+
+**Requirements:**
+
+- ConnectX-6 Dx or later with ``packet_pacing`` HCA capability.
+- The DevX path must be used (default). The legacy Verbs path
+ (``dv_flow_en=0``) does not support dynamic SQ modification and
+ returns ``-EINVAL``.
+- The queue must be started (SQ in RDY state) before setting a rate.
+
+**Supported hardware:**
+
+- ConnectX-6 Dx: per-SQ rate via HW rate table.
+- ConnectX-7/8: full support, coexists with wait-on-time scheduling.
+- BlueField-2/3: full support as DPU rep ports.
+
+**Not supported:**
+
+- ConnectX-5: ``packet_pacing`` exists but dynamic SQ modify may not
+ work on all firmware versions.
+- ConnectX-4 Lx and earlier: no ``packet_pacing`` capability.
+
+**Rate table sharing:**
+
+The hardware rate table has a limited number of entries (typically 128 on
+ConnectX-6 Dx). When multiple queues are configured with identical rate
+parameters, the kernel mlx5 driver shares a single rate table entry across
+them. Each queue still has its own independent SQ and enforces the rate
+independently — queues are never merged. The rate cap applies per-queue:
+if two queues share the same 1000 Mbps entry, each can send up to
+1000 Mbps independently, they do not share a combined budget.
+
+This sharing is transparent and only affects table capacity: 128 entries
+can serve thousands of queues as long as many use the same rate. Queues
+with different rates consume separate entries.
+
+**Usage with testpmd:**
+
+.. code-block:: console
+
+ testpmd> set port 0 queue 0 rate 1000
+ testpmd> show port 0 queue 0 rate
+ testpmd> set port 0 queue 0 rate 0
+
- ``tx_vec_en`` parameter [int]
A nonzero value enables Tx vector with ConnectX-5 NICs and above.
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e795948187..e718f0fa8c 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2621,6 +2621,7 @@ const struct eth_dev_ops mlx5_dev_ops = {
.map_aggr_tx_affinity = mlx5_map_aggr_tx_affinity,
.rx_metadata_negotiate = mlx5_flow_rx_metadata_negotiate,
.get_restore_flags = mlx5_get_restore_flags,
+ .set_queue_rate_limit = mlx5_set_queue_rate_limit,
};
/* Available operations from secondary process. */
@@ -2714,6 +2715,7 @@ const struct eth_dev_ops mlx5_dev_ops_isolate = {
.count_aggr_ports = mlx5_count_aggr_ports,
.map_aggr_tx_affinity = mlx5_map_aggr_tx_affinity,
.get_restore_flags = mlx5_get_restore_flags,
+ .set_queue_rate_limit = mlx5_set_queue_rate_limit,
};
/**
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 51f330454a..975ff57acd 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -222,6 +222,8 @@ struct mlx5_txq_ctrl *mlx5_txq_get(struct rte_eth_dev *dev, uint16_t idx);
int mlx5_txq_release(struct rte_eth_dev *dev, uint16_t idx);
int mlx5_txq_releasable(struct rte_eth_dev *dev, uint16_t idx);
int mlx5_txq_verify(struct rte_eth_dev *dev);
+int mlx5_set_queue_rate_limit(struct rte_eth_dev *dev, uint16_t queue_idx,
+ uint32_t tx_rate);
int mlx5_txq_get_sqn(struct mlx5_txq_ctrl *txq);
void mlx5_txq_alloc_elts(struct mlx5_txq_ctrl *txq_ctrl);
void mlx5_txq_free_elts(struct mlx5_txq_ctrl *txq_ctrl);
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index 3356c89758..ce08363ca9 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -1363,6 +1363,124 @@ mlx5_txq_release(struct rte_eth_dev *dev, uint16_t idx)
return 0;
}
+/**
+ * Set per-queue packet pacing rate limit.
+ *
+ * @param dev
+ * Pointer to Ethernet device.
+ * @param queue_idx
+ * TX queue index.
+ * @param tx_rate
+ * TX rate in Mbps, 0 to disable rate limiting.
+ *
+ * @return
+ * 0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_set_queue_rate_limit(struct rte_eth_dev *dev, uint16_t queue_idx,
+ uint32_t tx_rate)
+{
+ struct mlx5_priv *priv = dev->data->dev_private;
+ struct mlx5_dev_ctx_shared *sh = priv->sh;
+ struct mlx5_txq_ctrl *txq_ctrl;
+ struct mlx5_devx_obj *sq_devx;
+ struct mlx5_devx_modify_sq_attr sq_attr = { 0 };
+ struct mlx5_txq_rate_limit new_rate_limit = { 0 };
+ int ret;
+
+ if (!sh->cdev->config.hca_attr.qos.packet_pacing) {
+ DRV_LOG(ERR, "Port %u packet pacing not supported.",
+ dev->data->port_id);
+ rte_errno = ENOTSUP;
+ return -rte_errno;
+ }
+ if (priv->txqs == NULL || (*priv->txqs)[queue_idx] == NULL) {
+ DRV_LOG(ERR, "Port %u Tx queue %u not configured.",
+ dev->data->port_id, queue_idx);
+ rte_errno = EINVAL;
+ return -rte_errno;
+ }
+ txq_ctrl = container_of((*priv->txqs)[queue_idx],
+ struct mlx5_txq_ctrl, txq);
+ if (txq_ctrl->is_hairpin) {
+ DRV_LOG(ERR, "Port %u Tx queue %u is hairpin.",
+ dev->data->port_id, queue_idx);
+ rte_errno = EINVAL;
+ return -rte_errno;
+ }
+ if (txq_ctrl->obj == NULL) {
+ DRV_LOG(ERR, "Port %u Tx queue %u not initialized.",
+ dev->data->port_id, queue_idx);
+ rte_errno = EINVAL;
+ return -rte_errno;
+ }
+ /*
+ * For non-hairpin queues the SQ DevX object lives in
+ * obj->sq_obj.sq (used by DevX/HWS mode), while hairpin
+ * queues use obj->sq directly. These are different members
+ * of a union inside mlx5_txq_obj.
+ */
+ sq_devx = txq_ctrl->obj->sq_obj.sq;
+ if (sq_devx == NULL) {
+ DRV_LOG(ERR, "Port %u Tx queue %u SQ not ready.",
+ dev->data->port_id, queue_idx);
+ rte_errno = EINVAL;
+ return -rte_errno;
+ }
+ if (dev->data->tx_queue_state[queue_idx] !=
+ RTE_ETH_QUEUE_STATE_STARTED) {
+ DRV_LOG(ERR,
+ "Port %u Tx queue %u is not started, stop traffic before setting rate.",
+ dev->data->port_id, queue_idx);
+ rte_errno = EINVAL;
+ return -rte_errno;
+ }
+ if (tx_rate == 0) {
+ /* Disable rate limiting. */
+ if (txq_ctrl->rate_limit.pp_id == 0)
+ return 0; /* Already disabled. */
+ sq_attr.sq_state = MLX5_SQC_STATE_RDY;
+ sq_attr.state = MLX5_SQC_STATE_RDY;
+ sq_attr.rl_update = 1;
+ sq_attr.packet_pacing_rate_limit_index = 0;
+ ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr);
+ if (ret) {
+ DRV_LOG(ERR,
+ "Port %u Tx queue %u failed to clear rate.",
+ dev->data->port_id, queue_idx);
+ rte_errno = -ret;
+ return ret;
+ }
+ mlx5_txq_free_pp_rate_limit(&txq_ctrl->rate_limit);
+ DRV_LOG(DEBUG, "Port %u Tx queue %u rate limit disabled.",
+ dev->data->port_id, queue_idx);
+ return 0;
+ }
+ /* Allocate a new PP index for the requested rate into a temp. */
+ ret = mlx5_txq_alloc_pp_rate_limit(sh, &new_rate_limit, tx_rate);
+ if (ret)
+ return ret;
+ /* Modify live SQ to use the new PP index. */
+ sq_attr.sq_state = MLX5_SQC_STATE_RDY;
+ sq_attr.state = MLX5_SQC_STATE_RDY;
+ sq_attr.rl_update = 1;
+ sq_attr.packet_pacing_rate_limit_index = new_rate_limit.pp_id;
+ ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr);
+ if (ret) {
+ DRV_LOG(ERR, "Port %u Tx queue %u failed to set rate %u Mbps.",
+ dev->data->port_id, queue_idx, tx_rate);
+ mlx5_txq_free_pp_rate_limit(&new_rate_limit);
+ rte_errno = -ret;
+ return ret;
+ }
+ /* SQ updated — release old PP context, install new one. */
+ mlx5_txq_free_pp_rate_limit(&txq_ctrl->rate_limit);
+ txq_ctrl->rate_limit = new_rate_limit;
+ DRV_LOG(DEBUG, "Port %u Tx queue %u rate set to %u Mbps (PP idx %u).",
+ dev->data->port_id, queue_idx, tx_rate, txq_ctrl->rate_limit.pp_id);
+ return 0;
+}
+
/**
* Verify if the queue can be released.
*
--
2.43.0
More information about the dev
mailing list