[dpdk-dev] [PATCH 3/3] doc: update mlx guides

Adrien Mazarguil adrien.mazarguil at 6wind.com
Tue Jul 25 09:16:39 CEST 2017


Hi Shahaf,

On Mon, Jul 24, 2017 at 03:36:37PM +0300, Shahaf Shuler wrote:
> Update the guides with:
>    * New supported features.
>    * Supported OFED and FW versions.
>    * Quick start guide.
>    * Performance tunning guide.
> 
> Signed-off-by: Shahaf Shuler <shahafs at mellanox.com>
> Acked-by: Nelio Laranjeiro <nelio.laranjeiro at 6wind.com>

Thanks, QSG and performance tuning are especially useful. I have several
comments though (mostly nits), please see below.

> ---
>  doc/guides/nics/mlx4.rst | 161 +++++++++++++++++++++++++++++++---
>  doc/guides/nics/mlx5.rst | 220 +++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 343 insertions(+), 38 deletions(-)
> 
> diff --git a/doc/guides/nics/mlx4.rst b/doc/guides/nics/mlx4.rst
> index f1f26d4f9..23e14e52a 100644
> --- a/doc/guides/nics/mlx4.rst
> +++ b/doc/guides/nics/mlx4.rst
> @@ -1,5 +1,6 @@
>  ..  BSD LICENSE
>      Copyright 2012-2015 6WIND S.A.
> +    Copyright 2015 Mellanox.

I know several files got this wrong but the ending period is not necessary
for once, it's actually part of the "6WIND S.A." name on the previous
line. By the way, I intend to submit a patch soon to fix it in existing
files with additional clean up on top.

>  
>      Redistribution and use in source and binary forms, with or without
>      modification, are permitted provided that the following conditions
> @@ -76,6 +77,7 @@ Compiling librte_pmd_mlx4 causes DPDK to be linked against libibverbs.
>  Features
>  --------
>  
> +- Multi arch support: x86 and Power8.

Isn't "POWER8" always written all caps? Also see next comment. 

>  - RSS, also known as RCA, is supported. In this mode the number of
>    configured RX queues must be a power of two.
>  - VLAN filtering is supported.
> @@ -87,16 +89,7 @@ Features
>  - Inner L3/L4 (IP, TCP and UDP) TX/RX checksum offloading and validation.
>  - Outer L3 (IP) TX/RX checksum offloading and validation for VXLAN frames.
>  - Secondary process TX is supported.
> -
> -Limitations
> ------------
> -
> -- RSS hash key cannot be modified.
> -- RSS RETA cannot be configured
> -- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be
> -  dissociated.
> -- Hardware counters are not implemented (they are software counters).
> -- Secondary process RX is not supported.
> +- Rx interrupts.
>  
>  Configuration
>  -------------
> @@ -244,8 +237,8 @@ DPDK and must be installed separately:
>  
>  Currently supported by DPDK:
>  
> -- Mellanox OFED **4.0-2.0.0.0**.
> -- Firmware version **2.40.7000**.
> +- Mellanox OFED **4.1**.
> +- Firmware version **2.36.5000** and above.
>  - Supported architectures:  **x86_64** and **POWER8**.

So x86_64 and POWER8 then? (not "x86" as in "32 bit")

Actually I'm not sure architecture support can be considered a PMD feature
given that DPDK itself inevitably supports a larger set. I suggest dropping
the change made to the "Features" section above.

>  
>  Getting Mellanox OFED
> @@ -273,6 +266,150 @@ Supported NICs
>  
>  * Mellanox(R) ConnectX(R)-3 Pro 40G MCX354A-FCC_Ax (2*40G)
>  
> +Quick Start guide
> +------------------
> +
> +1. Download latest Mellanox OFED. For more info check the  `prerequisites`_.
> +
> +2. Install the required libraries and kernel modules either by installing
> +   only the required set, or by installing the entire Mellanox OFED:
> +
> +   For Bare metal use:
> +
> +   .. code-block:: console
> +
> +        ./mlnxofedinstall
> +
> +   For SR-IOV Hypervisors use:
> +
> +   .. code-block:: console
> +
> +        ./mlnxofedinstall --enable-sriov -hypervisor
> +
> +   For SR-IOV Virtual machine use:
> +
> +   .. code-block:: console
> +
> +        ./mlnxofedinstall --guest
> +
> +3. Verify the firmware is the correct one:
> +
> +   .. code-block:: console
> +
> +        ibv_devinfo
> +
> +4. Set all ports links to ethernet, follow instruction on the screen:

ethernet => Ethernet

> +
> +   .. code-block:: console
> +
> +        connectx_port_config
> +

You might want to describe the manual method as well:

 PCI=0001:02:03.4
 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0"
 echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1"

(actually I think this is what connectx_port_config does internally)

> +5. In case of bare metal or Hypervisor, config the optimized steering mode
> +   by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``:
> +
> +   .. code-block:: console
> +
> +        options mlx4_core log_num_mgm_entry_size=-7
> +
> +   .. note::
> +
> +        If VLAN filtering is used, set log_num_mgm_entry_size=-1.
> +        Performance degradation can occur on this case

Missing period.

> +
> +6. Restart the driver:
> +
> +   .. code-block:: console
> +
> +        /etc/init.d/openibd restart
> +   or:
> +
> +   .. code-block:: console
> +
> +        service openibd restart
> +
> +7. Enable MLX4 PMD on the ``.config`` file:
> +
> +    .. code-block:: console
> +
> +        CONFIG_RTE_LIBRTE_MLX4_PMD=y
> +

Looks like this duplicates the note about CONFIG_RTE_LIBRTE_MLX4_PMD in the
first section of this document. Maybe it should be removed.

> +8. Compile DPDK and you are ready to go:
> +
> +    .. code-block:: console
> +
> +        make config T=<cpu arch, compiler, ..>
> +        make

How about linking to the relevant build documentation instead of providing
an example, otherwise we'll have to maintain it.

> +
> +

Extra line (I think). The style in this file uses only one empty line to
separate sections.

> +Limitations and known issues
> +----------------------------
> +
> +- RSS hash key cannot be modified.
> +- RSS RETA cannot be configured
> +- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be
> +  dissociated.
> +- Hardware counters are not implemented (they are software counters).
> +- Secondary process RX is not supported.
> +

I suggest leaving this section unchanged and in its original spot to make
the diff shorter.

> +Performance tunning
> +-------------------

tunning => tuning

> +
> +1. Verify the optimized steering mode is configured

Missing period or colon?

> +
> +  .. code-block:: console
> +
> +        cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size
> +
> +2. Use environment variable MLX4_INLINE_RECV_SIZE=64 to get maximum
> +   performance for 64B messages.
> +
> +3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
> +   for better performance. For Virtual Machines (VM), verify that the right CPU

"Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the
reader understands what they are at this point.

> +   and NUMA node are pinned for the VM according to the above. Run

And you should remove "for the VM".

> +
> +   .. code-block:: console
> +
> +        lstopo-no-graphics
> +
> +   to identify the NUMA node to which the PCIe adapter is connected.
> +
> +4. If more than one adapter is used, and root complex capabilities enables
> +   to put both adapters on the same NUMA node without PCI bandwidth degredation,

degredation => degradation

> +   it is recommended to locate both adapters on the same NUMA node.
> +   This in order to forward packets from one to the other without
> +   NUMA performance penalty.
> +
> +5. Disable pause frames

Missing period or colon.

> +
> +   .. code-block:: console
> +
> +        ethtool -A <netdev> rx off tx off
> +
> +6. Verify IO non-posted prefetch is disabled by default. This can be checked
> +   via the BIOS configuration. Please contact you server provider for more
> +   information about the settings.
> +
> +.. hint::
> +
> +        On Some machines, depends on the machine intergrator, it is beneficial

Some => some
intergrator => integrator

> +        to set the PCI max read request parameter to 1K. This can be
> +        done in the following way:
> +
> +        To query the read request size use:
> +
> +        .. code-block:: console
> +
> +                setpci -s <NIC PCI address> 68.w
> +
> +        If the output is different than 3XXX, set it by:
> +
> +        .. code-block:: console
> +
> +                setpci -s <NIC PCI address> 68.w=3XXX
> +
> +        The XXX can be different on different systems. Make sure to configure
> +        according to the setpci output.
> +
>  Usage example
>  -------------
>  
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
> index a68b7adc0..8accd754b 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -1,5 +1,6 @@
>  ..  BSD LICENSE
>      Copyright 2015 6WIND S.A.
> +    Copyright 2015 Mellanox.

Same nit about the period.

>  
>      Redistribution and use in source and binary forms, with or without
>      modification, are permitted provided that the following conditions
> @@ -64,6 +65,9 @@ physical memory (or memory that does not belong to the current process).
>  This capability allows the PMD to coexist with kernel network interfaces
>  which remain functional, although they stop receiving unicast packets as
>  long as they share the same MAC address.
> +This means legacy linux control tools (for example:  ethtool, ifconfig and

Extra space before "ethtool".

> +more) can operate on the same network interfaces that owned by the DPDK
> +application.
>  
>  Enabling librte_pmd_mlx5 causes DPDK applications to be linked against
>  libibverbs.
> @@ -71,6 +75,7 @@ libibverbs.
>  Features
>  --------
>  
> +- Multi arch support: x86, Power8, ARMv8.

I think this line should not be added, for the same reasons as mlx4.

>  - Multiple TX and RX queues.
>  - Support for scattered TX and RX frames.
>  - IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
> @@ -92,14 +97,8 @@ Features
>  - RSS hash result is supported.
>  - Hardware TSO.
>  - Hardware checksum TX offload for VXLAN and GRE.
> -
> -Limitations
> ------------
> -
> -- Inner RSS for VXLAN frames is not supported yet.
> -- Port statistics through software counters only.
> -- Hardware checksum RX offloads for VXLAN inner header are not supported yet.
> -- Secondary process RX is not supported.

Limitations should stay here for a shorter diff.

> +- RX interrupts
> +- Statistics query including Basic, Extended and per queue.
>  
>  Configuration
>  -------------
> @@ -156,13 +155,12 @@ Run-time configuration
>  - ``rxq_cqe_comp_en`` parameter [int]
>  
>    A nonzero value enables the compression of CQE on RX side. This feature
> -  allows to save PCI bandwidth and improve performance at the cost of a
> -  slightly higher CPU usage.  Enabled by default.
> +  allows to save PCI bandwidth and improve performance. Enabled by default.
>  
>    Supported on:
>  
> -  - x86_64 with ConnectX4 and ConnectX4 LX
> -  - Power8 with ConnectX4 LX
> +  - x86_64 with ConnectX-4, ConnectX-4LX and ConnectX-5.
> +  - Power8 and ARMv8 with ConnectX-4LX and ConnectX-5.

Power8 => POWER8, and how about "ConnectX-4LX" => "ConnectX-4 LX"?

>  
>  - ``txq_inline`` parameter [int]
>  
> @@ -170,17 +168,26 @@ Run-time configuration
>    Can improve PPS performance when PCI back pressure is detected and may be
>    useful for scenarios involving heavy traffic on many queues.
>  
> -  It is not enabled by default (set to 0) since the additional software
> -  logic necessary to handle this mode can lower performance when back
> +  Since the additional software logic necessary to handle this mode this

How about:

 Because additional software logic is necessary to handle this mode, this

> +  option should be used with care, as it can lower performance when back
>    pressure is not expected.
>  
>  - ``txqs_min_inline`` parameter [int]
>  
>    Enable inline send only when the number of TX queues is greater or equal
>    to this value.
> -
>    This option should be used in combination with ``txq_inline`` above.

Removing the empty line causes both lines to be coalesced into a single
paragraph, if that's the intent you should move the contents of the second
line at the end of the first one.

>  
> +  On ConnectX-4/ConnectX-4LX:

How about "ConnectX-4, ConnectX-4 LX and ConnectX-5 without Enhanced MPW"?

> +
> +        - disabled by default. in case ``txq_inline`` is set recommendation is 4.

How about:

 - Disabled by default.
 - In case ``txq_inline`` is set, recommendation is 4.

> +
> +  On ConnectX-5:

"On ConnectX-5 with Enhanced MPW enabled"

> +
> +        - when Enhanced MPW is enabled, it is set to 8 by default.

How about:

 - Set to 8 by default.

> +        - otherwise disabled by default. in case ``txq_inline`` is set
> +          use same values as ConnectX-4/ConnectX-4LX.

With the above changes, no need for such duplication.

> +
>  - ``txq_mpw_en`` parameter [int]
>  
>    A nonzero value enables multi-packet send (MPS) for ConnectX-4 Lx and
> @@ -221,9 +228,7 @@ Run-time configuration
>  
>    A nonzero value enables hardware TSO.
>    When hardware TSO is enabled, packets marked with TCP segmentation
> -  offload will be divided into segments by the hardware.
> -
> -  Disabled by default.
> +  offload will be divided into segments by the hardware. Disabled by default.

Is coalescing on purpose?

>  
>  Prerequisites
>  -------------
> @@ -279,13 +284,13 @@ DPDK and must be installed separately:
>  
>  Currently supported by DPDK:
>  
> -- Mellanox OFED version: **4.0-2.0.0.0**
> +- Mellanox OFED version: **4.1**.
>  - firmware version:
>  
> -  - ConnectX-4: **12.18.2000**
> -  - ConnectX-4 Lx: **14.18.2000**
> -  - ConnectX-5: **16.19.1200**
> -  - ConnectX-5 Ex: **16.19.1200**
> +  - ConnectX-4: **12.20.1010** and above.
> +  - ConnectX-4 Lx: **14.20.1010** and above.
> +  - ConnectX-5: **16.20.1010** and above.
> +  - ConnectX-5 Ex: **16.20.1010** and above.
>  
>  Getting Mellanox OFED
>  ~~~~~~~~~~~~~~~~~~~~~
> @@ -330,10 +335,103 @@ Supported NICs
>  * Mellanox(R) ConnectX(R)-5 100G MCX556A-ECAT (2x100G)
>  * Mellanox(R) ConnectX(R)-5 Ex EN 100G MCX516A-CDAT (2x100G)
>  
> -Known issues
> -------------
> +Quick Start guide
> +------------------

"Quick Start guide" => either "Quick start guide" or "Quick Start Guide"

> +
> +1. Download latest Mellanox OFED. For more info check the  `prerequisites`_.
> +
> +
> +2. Install the required libraries and kernel modules either by installing
> +   only the required set, or by installing the entire Mellanox OFED:
> +
> +   .. code-block:: console
> +
> +        ./mlnxofedinstall
> +
> +3. Verify the firmware is the correct one:
> +
> +   .. code-block:: console
> +
> +        ibv_devinfo
> +
> +4. Verify all ports links are set to Ethernet:
> +
> +   .. code-block:: console
> +
> +        mlxconfig -d <mst device> query | grep LINK_TYPE
> +        LINK_TYPE_P1                        ETH(2)
> +        LINK_TYPE_P2                        ETH(2)
> +
> +   If the Links are not in the current protocol move the to Ethernet:

Links => links
the => them

"the current protocol" is rather unclear, how about:

 Link types may have to be configured to Ethernet:

> +
> +   .. code-block:: console
> +
> +        mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
> +
> +        * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
> +
> +   For Hypervisors verify SR-IOV is enabled on the NIC:

Hypervisors => hypervisors

> +
> +   .. code-block:: console
> +
> +        mlxconfig -d <mst device> query | grep SRIOV_EN
> +        SRIOV_EN                            True(1)
> +
> +   If Needed, set enable the set the relevant fields:

Needed => needed

>  
> -* **Flow pattern without any specific vlan will match for vlan packets as well.**
> +   .. code-block:: console
> +
> +        mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
> +        mlxfwreset -d <mst device> reset
> +
> +5. Restart the driver:
> +
> +   .. code-block:: console
> +
> +        /etc/init.d/openibd restart
> +   or:
> +
> +   .. code-block:: console
> +
> +        service openibd restart
> +
> +   If port link protocol was changed need to reset the fw as well:

How about:

 If link type was changed, firmware must be reset as well:

> +
> +   .. code-block:: console
> +
> +        mlxfwreset -d <mst device> reset
> +
> +   For Hypervisors, after reset write the sysfs number of Virtual Functions

Hypervisors => hypervisors
Virtual Functions => virtual functions (why all the caps?)

> +   needed for the PF.

<< Inserting an empty line might make sense here.

> +   The following is an example of a standard Linux kernel generated file that
> +   is available in the new kernels:

You did not provide a specific kernel version. It's a rather old feature
actually, and since it is documented for almost all other PMDs, how about:

 To dynamically instantiate a given number of virtual functions (VFs):

> +
> +   .. code-block:: console
> +
> +        echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
> +
> +

Extra empty line.

> +6. Enable MLX5 PMD in the ``.config`` file :
> +
> +    .. code-block:: console
> +
> +        CONFIG_RTE_LIBRTE_MLX5_PMD=y
> +
> +7. Compile DPDK and you are ready to go:
> +
> +    .. code-block:: console
> +
> +        make config T=<cpu arch, compiler, ..>
> +        make

Same comments for 6. and 7. as their mlx4 counterparts.

> +
> +Limitations and Known issues
> +----------------------------
> +
> +- Inner RSS for VXLAN frames is not supported yet.
> +- Port statistics through software counters only.
> +- Hardware checksum RX offloads for VXLAN inner header are not supported yet.
> +- Secondary process RX is not supported.
> +- Flow pattern without any specific vlan will match for vlan packets as well:

I suggest leaving this section in its original spot.

>  
>    When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
>    Meaning, the flow rule::
> @@ -350,6 +448,76 @@ Known issues
>  
>    Will match any ipv4 packet (VLAN included).
>  
> +Performance tunning
> +-------------------

tunning => tuning

> +
> +1. Configure aggressive CQE Zipping for maximum performance

Missing period or colon.

> +
> +  .. code-block:: console
> +
> +        mlxconfig -d <mst device> s CQE_COMPRESSION=1
> +
> +  To set it back to the default CQE Zipping mode use

Missing period or colon.

> +
> +  .. code-block:: console
> +
> +        mlxconfig -d <mst device> s CQE_COMPRESSION=0
> +
> +2. In case of Virtualization:

Virtualization => virtualization

> +
> +   - Make sure that Hypervisor kernel is 3.16 or newer.

Hypervisor => hypervisor

> +   - Configure boot with "iommu=pt".

How about `` `` instead of ""?

> +   - Use 1G huge pages.
> +   - Make sure to allocate a VM on huge pages.
> +   - Make sure to set CPU pinning.
> +
> +3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
> +   for better performance. For Virtual Machines (VM), verify that the right CPU

"Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the
reader understands what they are at this point.

> +   and NUMA node are pinned for the VM according to the above. Run

And you should remove "for the VM".

> +
> +   .. code-block:: console
> +
> +        lstopo-no-graphics
> +
> +   to identify the NUMA node to which the PCIe adapter is connected.
> +
> +4. If more than one adapter is used, and root complex capabilities enables
> +   to put both adapters on the same NUMA node without PCI bandwidth degredation,

degredation => degradation

> +   it is recommended to locate both adapters on the same NUMA node.
> +   This in order to forward packets from one to the other without
> +   NUMA performance penalty.
> +
> +5. Disable pause frames

Missing period or colon.

> +
> +   .. code-block:: console
> +
> +        ethtool -A <netdev> rx off tx off
> +
> +6. Verify IO non-posted prefetch is disabled by default. This can be checked
> +   via the BIOS configuration. Please contact you server provider for more
> +   information about the settings.
> +
> +.. hint::
> +
> +        On Some machines, depends on the machine intergrator, it is beneficial

Some => some
intergrator => integrator

> +        to set the PCI max read request parameter to 1K. This can be
> +        done in the following way:
> +
> +        To query the read request size use:
> +
> +        .. code-block:: console
> +
> +                setpci -s <NIC PCI address> 68.w
> +
> +        If the output is different than 3XXX, set it by:
> +
> +        .. code-block:: console
> +
> +                setpci -s <NIC PCI address> 68.w=3XXX
> +
> +        The XXX can be different on different systems. Make sure to configure
> +        according to the setpci output.
> +
>  Notes for testpmd
>  -----------------
>  
> -- 
> 2.12.0
> 

-- 
Adrien Mazarguil
6WIND


More information about the dev mailing list