[PATCH v1 01/10] doc/nics/mlx5: fix stale packet pacing documentation
Slava Ovsiienko
viacheslavo at nvidia.com
Wed Mar 11 13:26:02 CET 2026
Nice clarification, Vincent.
Thank you.
Acked-by: Viacheslav Ovsiienko <viacheslavo at nvidia.com>
> -----Original Message-----
> From: Vincent Jardin <vjardin at free.fr>
> Sent: Tuesday, March 10, 2026 11:20 AM
> To: dev at dpdk.org
> Cc: Raslan Darawsheh <rasland at nvidia.com>; NBU-Contact-Thomas Monjalon
> (EXTERNAL) <thomas at monjalon.net>; andrew.rybchenko at oktetlabs.ru;
> Dariusz Sosnowski <dsosnowski at nvidia.com>; Slava Ovsiienko
> <viacheslavo at nvidia.com>; Bing Zhao <bingz at nvidia.com>; Ori Kam
> <orika at nvidia.com>; Suanming Mou <suanmingm at nvidia.com>; Matan Azrad
> <matan at nvidia.com>; Vincent Jardin <vjardin at free.fr>
> Subject: [PATCH v1 01/10] doc/nics/mlx5: fix stale packet pacing documentation
>
> The Tx Scheduling section incorrectly stated that timestamps can only be put on
> the first packet in a burst. The driver actually checks every packet's ol_flags for
> the timestamp dynamic flag and inserts a dedicated WAIT WQE per
> timestamped packet. The eMPW path also breaks batches when a timestamped
> packet is encountered.
>
> Additionally, the ConnectX-7+ wait-on-time capability was only briefly
> mentioned in the tx_pp parameter section with no explanation of how it differs
> from the ConnectX-6 Dx Clock Queue approach.
>
> This patch:
> - Removes the stale first-packet-only limitation
> - Documents both scheduling mechanisms (ConnectX-6 Dx Clock Queue and
> ConnectX-7+ wait-on-time) with separate requirements tables
> - Clarifies that tx_pp is specific to ConnectX-6 Dx
> - Fixes tx_skew applicability to cover both hardware generations
> - Updates the Send Scheduling Counters intro to reflect that timestamp
> validation counters also apply to ConnectX-7+ wait-on-time mode
>
> Fixes: 8f848f32fc24 ("net/mlx5: introduce send scheduling devargs")
>
> Signed-off-by: Vincent Jardin <vjardin at free.fr>
> ---
> doc/guides/nics/mlx5.rst | 109 ++++++++++++++++++++++++++++-----------
> 1 file changed, 78 insertions(+), 31 deletions(-)
>
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst index
> 2529c2f4c8..5b097dbc90 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -553,27 +553,32 @@ for an additional list of options shared with other
> mlx5 drivers.
>
> - ``tx_pp`` parameter [int]
>
> + This parameter applies to **ConnectX-6 Dx** only.
> If a nonzero value is specified the driver creates all necessary internal
> - objects to provide accurate packet send scheduling on mbuf timestamps.
> + objects (Clock Queue and Rearm Queue) to provide accurate packet send
> + scheduling on mbuf timestamps using a cross-channel approach.
> The positive value specifies the scheduling granularity in nanoseconds,
> the packet send will be accurate up to specified digits. The allowed range is
> from 500 to 1 million of nanoseconds. The negative value specifies the module
> of granularity and engages the special test mode the check the schedule rate.
> By default (if the ``tx_pp`` is not specified) send scheduling on timestamps
> - feature is disabled.
> + feature is disabled on ConnectX-6 Dx.
>
> - Starting with ConnectX-7 the capability to schedule traffic directly
> - on timestamp specified in descriptor is provided,
> - no extra objects are needed anymore and scheduling capability
> - is advertised and handled regardless ``tx_pp`` parameter presence.
> + Starting with **ConnectX-7** the hardware provides a native
> + wait-on-time capability that inserts the scheduling delay directly in the WQE
> descriptor.
> + No Clock Queue or Rearm Queue is needed and the ``tx_pp`` parameter
> + is not required. The driver automatically advertises send scheduling
> + support when the HCA wait-on-time capability is detected. The
> + ``tx_skew`` parameter can still be used on ConnectX-7 and above to
> compensate for wire delay.
>
> - ``tx_skew`` parameter [int]
>
> The parameter adjusts the send packet scheduling on timestamps and
> represents
> the average delay between beginning of the transmitting descriptor processing
> by the hardware and appearance of actual packet data on the wire. The value
> - should be provided in nanoseconds and is valid only if ``tx_pp`` parameter is
> - specified. The default value is zero.
> + should be provided in nanoseconds and applies to both ConnectX-6 Dx
> + (with ``tx_pp``) and ConnectX-7+ (wait-on-time) scheduling modes.
> + The default value is zero.
>
> - ``tx_vec_en`` parameter [int]
>
> @@ -883,9 +888,13 @@ Send Scheduling Counters
>
> The mlx5 PMD provides a comprehensive set of counters designed for
> debugging and diagnostics related to packet scheduling during transmission.
> -These counters are applicable only if the port was configured with the ``tx_pp``
> devarg -and reflect the status of the PMD scheduling infrastructure -based on
> Clock and Rearm Queues, used as a workaround on ConnectX-6 DX NICs.
> +The first group of counters (prefixed ``tx_pp_``) reflects the status
> +of the Clock Queue and Rearm Queue infrastructure used on ConnectX-6 Dx
> +and is applicable only if the port was configured with the ``tx_pp`` devarg.
> +The timestamp validation counters
> +(``tx_pp_timestamp_past_errors``, ``tx_pp_timestamp_future_errors``,
> +``tx_pp_timestamp_order_errors``) are also reported on ConnectX-7 and
> +above in wait-on-time mode, without requiring ``tx_pp``.
>
> ``tx_pp_missed_interrupt_errors``
> Indicates that the Rearm Queue interrupt was not serviced on time.
> @@ -1960,31 +1969,54 @@ Limitations
> Tx Scheduling
> ~~~~~~~~~~~~~
>
> -When PMD sees the ``RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME`` set on
> the packet -being sent it tries to synchronize the time of packet appearing on -
> the wire with the specified packet timestamp. If the specified one -is in the past it
> should be ignored, if one is in the distant future -it should be capped with some
> reasonable value (in range of seconds).
> -These specific cases ("too late" and "distant future") can be optionally -
> reported via device xstats to assist applications to detect the -time-related
> problems.
> -
> -The timestamp upper "too-distant-future" limit -at the moment of invoking the
> Tx burst routine -can be estimated as ``tx_pp`` option (in nanoseconds)
> multiplied by 2^23.
> +When the PMD sees ``RTE_MBUF_DYNFLAG_TX_TIMESTAMP_NAME`` set on
> a
> +packet being sent it inserts a dedicated WAIT WQE to synchronize the
> +time of the packet appearing on the wire with the specified timestamp.
> +Every packet in a burst that carries the timestamp dynamic flag is
> +individually scheduled -- there is no restriction to the first packet only.
> +
> +If the specified timestamp is in the past, the packet is sent immediately.
> +If it is in the distant future it should be capped with some reasonable
> +value (in range of seconds). These specific cases ("too late" and
> +"distant future") can be optionally reported via device xstats to
> +assist applications to detect time-related problems.
> +
> +The eMPW (enhanced Multi-Packet Write) data path automatically breaks
> +the batch when a timestamped packet is encountered, ensuring each
> +scheduled packet gets its own WAIT WQE.
> +
> +Two hardware mechanisms are supported:
> +
> +**ConnectX-6 Dx -- Clock Queue (cross-channel)**
> + The driver creates a Clock Queue and a Rearm Queue that together
> + provide a time reference for scheduling. This mode requires the
> + :ref:`tx_pp <mlx5_tx_pp_param>` devarg. The timestamp upper
> + "too-distant-future" limit at the moment of invoking the Tx burst
> + routine can be estimated as ``tx_pp`` (in nanoseconds) multiplied
> + by 2^23.
> +
> +**ConnectX-7 and above -- wait-on-time**
> + The hardware supports placing the scheduling delay directly inside
> + the WQE descriptor. No Clock Queue or Rearm Queue is needed and the
> + ``tx_pp`` devarg is **not** required. The driver automatically
> + advertises send scheduling support when the HCA wait-on-time
> + capability is detected.
> +
> Please note, for the testpmd txonly mode, the limit is deduced from the
> expression::
>
> (n_tx_descriptors / burst_size + 1) * inter_burst_gap
>
> -There is no any packet reordering according timestamps is supposed, -neither
> within packet burst, nor between packets, it is an entirely -application
> responsibility to generate packets and its timestamps -in desired order.
> +There is no packet reordering according to timestamps, neither within a
> +packet burst, nor between packets. It is entirely the application's
> +responsibility to generate packets and their timestamps in the desired
> +order.
>
> Requirements
> ^^^^^^^^^^^^
>
> +ConnectX-6 Dx (Clock Queue mode):
> +
> ========= =============
> Minimum Version
> ========= =============
> @@ -1996,20 +2028,35 @@ rdma-core
> DPDK 20.08
> ========= =============
>
> +ConnectX-7 and above (wait-on-time mode):
> +
> +========= =============
> +Minimum Version
> +========= =============
> +hardware ConnectX-7
> +========= =============
> +
> Firmware configuration
> ^^^^^^^^^^^^^^^^^^^^^^
>
> Runtime configuration
> ^^^^^^^^^^^^^^^^^^^^^
>
> -To provide the packet send scheduling on mbuf timestamps the ``tx_pp`` -
> parameter should be specified.
> +**ConnectX-6 Dx**: the :ref:`tx_pp <mlx5_tx_pp_param>` parameter must
> +be specified to enable send scheduling on mbuf timestamps.
> +
> +**ConnectX-7+**: no devarg is required. Send scheduling is
> +automatically enabled when the HCA reports the wait-on-time capability.
> +
> +On both hardware generations the ``tx_skew`` parameter can be used to
> +compensate for the delay between descriptor processing and actual wire
> +time.
>
> Limitations
> ^^^^^^^^^^^
>
> -#. The timestamps can be put only in the first packet
> - in the burst providing the entire burst scheduling.
> +#. On ConnectX-6 Dx (Clock Queue mode) timestamps too far in the future
> + are capped (see the ``tx_pp`` x 2^23 limit above).
>
>
> .. _mlx5_tx_inline:
> --
> 2.43.0
More information about the dev
mailing list