[dpdk-dev] [PATCH 3/3] doc: update mlx guides
Adrien Mazarguil
adrien.mazarguil at 6wind.com
Tue Jul 25 09:16:39 CEST 2017
Hi Shahaf,
On Mon, Jul 24, 2017 at 03:36:37PM +0300, Shahaf Shuler wrote:
> Update the guides with:
> * New supported features.
> * Supported OFED and FW versions.
> * Quick start guide.
> * Performance tunning guide.
>
> Signed-off-by: Shahaf Shuler <shahafs at mellanox.com>
> Acked-by: Nelio Laranjeiro <nelio.laranjeiro at 6wind.com>
Thanks, QSG and performance tuning are especially useful. I have several
comments though (mostly nits), please see below.
> ---
> doc/guides/nics/mlx4.rst | 161 +++++++++++++++++++++++++++++++---
> doc/guides/nics/mlx5.rst | 220 +++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 343 insertions(+), 38 deletions(-)
>
> diff --git a/doc/guides/nics/mlx4.rst b/doc/guides/nics/mlx4.rst
> index f1f26d4f9..23e14e52a 100644
> --- a/doc/guides/nics/mlx4.rst
> +++ b/doc/guides/nics/mlx4.rst
> @@ -1,5 +1,6 @@
> .. BSD LICENSE
> Copyright 2012-2015 6WIND S.A.
> + Copyright 2015 Mellanox.
I know several files got this wrong but the ending period is not necessary
for once, it's actually part of the "6WIND S.A." name on the previous
line. By the way, I intend to submit a patch soon to fix it in existing
files with additional clean up on top.
>
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions
> @@ -76,6 +77,7 @@ Compiling librte_pmd_mlx4 causes DPDK to be linked against libibverbs.
> Features
> --------
>
> +- Multi arch support: x86 and Power8.
Isn't "POWER8" always written all caps? Also see next comment.
> - RSS, also known as RCA, is supported. In this mode the number of
> configured RX queues must be a power of two.
> - VLAN filtering is supported.
> @@ -87,16 +89,7 @@ Features
> - Inner L3/L4 (IP, TCP and UDP) TX/RX checksum offloading and validation.
> - Outer L3 (IP) TX/RX checksum offloading and validation for VXLAN frames.
> - Secondary process TX is supported.
> -
> -Limitations
> ------------
> -
> -- RSS hash key cannot be modified.
> -- RSS RETA cannot be configured
> -- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be
> - dissociated.
> -- Hardware counters are not implemented (they are software counters).
> -- Secondary process RX is not supported.
> +- Rx interrupts.
>
> Configuration
> -------------
> @@ -244,8 +237,8 @@ DPDK and must be installed separately:
>
> Currently supported by DPDK:
>
> -- Mellanox OFED **4.0-2.0.0.0**.
> -- Firmware version **2.40.7000**.
> +- Mellanox OFED **4.1**.
> +- Firmware version **2.36.5000** and above.
> - Supported architectures: **x86_64** and **POWER8**.
So x86_64 and POWER8 then? (not "x86" as in "32 bit")
Actually I'm not sure architecture support can be considered a PMD feature
given that DPDK itself inevitably supports a larger set. I suggest dropping
the change made to the "Features" section above.
>
> Getting Mellanox OFED
> @@ -273,6 +266,150 @@ Supported NICs
>
> * Mellanox(R) ConnectX(R)-3 Pro 40G MCX354A-FCC_Ax (2*40G)
>
> +Quick Start guide
> +------------------
> +
> +1. Download latest Mellanox OFED. For more info check the `prerequisites`_.
> +
> +2. Install the required libraries and kernel modules either by installing
> + only the required set, or by installing the entire Mellanox OFED:
> +
> + For Bare metal use:
> +
> + .. code-block:: console
> +
> + ./mlnxofedinstall
> +
> + For SR-IOV Hypervisors use:
> +
> + .. code-block:: console
> +
> + ./mlnxofedinstall --enable-sriov -hypervisor
> +
> + For SR-IOV Virtual machine use:
> +
> + .. code-block:: console
> +
> + ./mlnxofedinstall --guest
> +
> +3. Verify the firmware is the correct one:
> +
> + .. code-block:: console
> +
> + ibv_devinfo
> +
> +4. Set all ports links to ethernet, follow instruction on the screen:
ethernet => Ethernet
> +
> + .. code-block:: console
> +
> + connectx_port_config
> +
You might want to describe the manual method as well:
PCI=0001:02:03.4
echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port0"
echo eth > "/sys/bus/pci/devices/$PCI/mlx4_port1"
(actually I think this is what connectx_port_config does internally)
> +5. In case of bare metal or Hypervisor, config the optimized steering mode
> + by adding the following line to ``/etc/modprobe.d/mlx4_core.conf``:
> +
> + .. code-block:: console
> +
> + options mlx4_core log_num_mgm_entry_size=-7
> +
> + .. note::
> +
> + If VLAN filtering is used, set log_num_mgm_entry_size=-1.
> + Performance degradation can occur on this case
Missing period.
> +
> +6. Restart the driver:
> +
> + .. code-block:: console
> +
> + /etc/init.d/openibd restart
> + or:
> +
> + .. code-block:: console
> +
> + service openibd restart
> +
> +7. Enable MLX4 PMD on the ``.config`` file:
> +
> + .. code-block:: console
> +
> + CONFIG_RTE_LIBRTE_MLX4_PMD=y
> +
Looks like this duplicates the note about CONFIG_RTE_LIBRTE_MLX4_PMD in the
first section of this document. Maybe it should be removed.
> +8. Compile DPDK and you are ready to go:
> +
> + .. code-block:: console
> +
> + make config T=<cpu arch, compiler, ..>
> + make
How about linking to the relevant build documentation instead of providing
an example, otherwise we'll have to maintain it.
> +
> +
Extra line (I think). The style in this file uses only one empty line to
separate sections.
> +Limitations and known issues
> +----------------------------
> +
> +- RSS hash key cannot be modified.
> +- RSS RETA cannot be configured
> +- RSS always includes L3 (IPv4/IPv6) and L4 (UDP/TCP). They cannot be
> + dissociated.
> +- Hardware counters are not implemented (they are software counters).
> +- Secondary process RX is not supported.
> +
I suggest leaving this section unchanged and in its original spot to make
the diff shorter.
> +Performance tunning
> +-------------------
tunning => tuning
> +
> +1. Verify the optimized steering mode is configured
Missing period or colon?
> +
> + .. code-block:: console
> +
> + cat /sys/module/mlx4_core/parameters/log_num_mgm_entry_size
> +
> +2. Use environment variable MLX4_INLINE_RECV_SIZE=64 to get maximum
> + performance for 64B messages.
> +
> +3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
> + for better performance. For Virtual Machines (VM), verify that the right CPU
"Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the
reader understands what they are at this point.
> + and NUMA node are pinned for the VM according to the above. Run
And you should remove "for the VM".
> +
> + .. code-block:: console
> +
> + lstopo-no-graphics
> +
> + to identify the NUMA node to which the PCIe adapter is connected.
> +
> +4. If more than one adapter is used, and root complex capabilities enables
> + to put both adapters on the same NUMA node without PCI bandwidth degredation,
degredation => degradation
> + it is recommended to locate both adapters on the same NUMA node.
> + This in order to forward packets from one to the other without
> + NUMA performance penalty.
> +
> +5. Disable pause frames
Missing period or colon.
> +
> + .. code-block:: console
> +
> + ethtool -A <netdev> rx off tx off
> +
> +6. Verify IO non-posted prefetch is disabled by default. This can be checked
> + via the BIOS configuration. Please contact you server provider for more
> + information about the settings.
> +
> +.. hint::
> +
> + On Some machines, depends on the machine intergrator, it is beneficial
Some => some
intergrator => integrator
> + to set the PCI max read request parameter to 1K. This can be
> + done in the following way:
> +
> + To query the read request size use:
> +
> + .. code-block:: console
> +
> + setpci -s <NIC PCI address> 68.w
> +
> + If the output is different than 3XXX, set it by:
> +
> + .. code-block:: console
> +
> + setpci -s <NIC PCI address> 68.w=3XXX
> +
> + The XXX can be different on different systems. Make sure to configure
> + according to the setpci output.
> +
> Usage example
> -------------
>
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
> index a68b7adc0..8accd754b 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -1,5 +1,6 @@
> .. BSD LICENSE
> Copyright 2015 6WIND S.A.
> + Copyright 2015 Mellanox.
Same nit about the period.
>
> Redistribution and use in source and binary forms, with or without
> modification, are permitted provided that the following conditions
> @@ -64,6 +65,9 @@ physical memory (or memory that does not belong to the current process).
> This capability allows the PMD to coexist with kernel network interfaces
> which remain functional, although they stop receiving unicast packets as
> long as they share the same MAC address.
> +This means legacy linux control tools (for example: ethtool, ifconfig and
Extra space before "ethtool".
> +more) can operate on the same network interfaces that owned by the DPDK
> +application.
>
> Enabling librte_pmd_mlx5 causes DPDK applications to be linked against
> libibverbs.
> @@ -71,6 +75,7 @@ libibverbs.
> Features
> --------
>
> +- Multi arch support: x86, Power8, ARMv8.
I think this line should not be added, for the same reasons as mlx4.
> - Multiple TX and RX queues.
> - Support for scattered TX and RX frames.
> - IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
> @@ -92,14 +97,8 @@ Features
> - RSS hash result is supported.
> - Hardware TSO.
> - Hardware checksum TX offload for VXLAN and GRE.
> -
> -Limitations
> ------------
> -
> -- Inner RSS for VXLAN frames is not supported yet.
> -- Port statistics through software counters only.
> -- Hardware checksum RX offloads for VXLAN inner header are not supported yet.
> -- Secondary process RX is not supported.
Limitations should stay here for a shorter diff.
> +- RX interrupts
> +- Statistics query including Basic, Extended and per queue.
>
> Configuration
> -------------
> @@ -156,13 +155,12 @@ Run-time configuration
> - ``rxq_cqe_comp_en`` parameter [int]
>
> A nonzero value enables the compression of CQE on RX side. This feature
> - allows to save PCI bandwidth and improve performance at the cost of a
> - slightly higher CPU usage. Enabled by default.
> + allows to save PCI bandwidth and improve performance. Enabled by default.
>
> Supported on:
>
> - - x86_64 with ConnectX4 and ConnectX4 LX
> - - Power8 with ConnectX4 LX
> + - x86_64 with ConnectX-4, ConnectX-4LX and ConnectX-5.
> + - Power8 and ARMv8 with ConnectX-4LX and ConnectX-5.
Power8 => POWER8, and how about "ConnectX-4LX" => "ConnectX-4 LX"?
>
> - ``txq_inline`` parameter [int]
>
> @@ -170,17 +168,26 @@ Run-time configuration
> Can improve PPS performance when PCI back pressure is detected and may be
> useful for scenarios involving heavy traffic on many queues.
>
> - It is not enabled by default (set to 0) since the additional software
> - logic necessary to handle this mode can lower performance when back
> + Since the additional software logic necessary to handle this mode this
How about:
Because additional software logic is necessary to handle this mode, this
> + option should be used with care, as it can lower performance when back
> pressure is not expected.
>
> - ``txqs_min_inline`` parameter [int]
>
> Enable inline send only when the number of TX queues is greater or equal
> to this value.
> -
> This option should be used in combination with ``txq_inline`` above.
Removing the empty line causes both lines to be coalesced into a single
paragraph, if that's the intent you should move the contents of the second
line at the end of the first one.
>
> + On ConnectX-4/ConnectX-4LX:
How about "ConnectX-4, ConnectX-4 LX and ConnectX-5 without Enhanced MPW"?
> +
> + - disabled by default. in case ``txq_inline`` is set recommendation is 4.
How about:
- Disabled by default.
- In case ``txq_inline`` is set, recommendation is 4.
> +
> + On ConnectX-5:
"On ConnectX-5 with Enhanced MPW enabled"
> +
> + - when Enhanced MPW is enabled, it is set to 8 by default.
How about:
- Set to 8 by default.
> + - otherwise disabled by default. in case ``txq_inline`` is set
> + use same values as ConnectX-4/ConnectX-4LX.
With the above changes, no need for such duplication.
> +
> - ``txq_mpw_en`` parameter [int]
>
> A nonzero value enables multi-packet send (MPS) for ConnectX-4 Lx and
> @@ -221,9 +228,7 @@ Run-time configuration
>
> A nonzero value enables hardware TSO.
> When hardware TSO is enabled, packets marked with TCP segmentation
> - offload will be divided into segments by the hardware.
> -
> - Disabled by default.
> + offload will be divided into segments by the hardware. Disabled by default.
Is coalescing on purpose?
>
> Prerequisites
> -------------
> @@ -279,13 +284,13 @@ DPDK and must be installed separately:
>
> Currently supported by DPDK:
>
> -- Mellanox OFED version: **4.0-2.0.0.0**
> +- Mellanox OFED version: **4.1**.
> - firmware version:
>
> - - ConnectX-4: **12.18.2000**
> - - ConnectX-4 Lx: **14.18.2000**
> - - ConnectX-5: **16.19.1200**
> - - ConnectX-5 Ex: **16.19.1200**
> + - ConnectX-4: **12.20.1010** and above.
> + - ConnectX-4 Lx: **14.20.1010** and above.
> + - ConnectX-5: **16.20.1010** and above.
> + - ConnectX-5 Ex: **16.20.1010** and above.
>
> Getting Mellanox OFED
> ~~~~~~~~~~~~~~~~~~~~~
> @@ -330,10 +335,103 @@ Supported NICs
> * Mellanox(R) ConnectX(R)-5 100G MCX556A-ECAT (2x100G)
> * Mellanox(R) ConnectX(R)-5 Ex EN 100G MCX516A-CDAT (2x100G)
>
> -Known issues
> -------------
> +Quick Start guide
> +------------------
"Quick Start guide" => either "Quick start guide" or "Quick Start Guide"
> +
> +1. Download latest Mellanox OFED. For more info check the `prerequisites`_.
> +
> +
> +2. Install the required libraries and kernel modules either by installing
> + only the required set, or by installing the entire Mellanox OFED:
> +
> + .. code-block:: console
> +
> + ./mlnxofedinstall
> +
> +3. Verify the firmware is the correct one:
> +
> + .. code-block:: console
> +
> + ibv_devinfo
> +
> +4. Verify all ports links are set to Ethernet:
> +
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> query | grep LINK_TYPE
> + LINK_TYPE_P1 ETH(2)
> + LINK_TYPE_P2 ETH(2)
> +
> + If the Links are not in the current protocol move the to Ethernet:
Links => links
the => them
"the current protocol" is rather unclear, how about:
Link types may have to be configured to Ethernet:
> +
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
> +
> + * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
> +
> + For Hypervisors verify SR-IOV is enabled on the NIC:
Hypervisors => hypervisors
> +
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> query | grep SRIOV_EN
> + SRIOV_EN True(1)
> +
> + If Needed, set enable the set the relevant fields:
Needed => needed
>
> -* **Flow pattern without any specific vlan will match for vlan packets as well.**
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
> + mlxfwreset -d <mst device> reset
> +
> +5. Restart the driver:
> +
> + .. code-block:: console
> +
> + /etc/init.d/openibd restart
> + or:
> +
> + .. code-block:: console
> +
> + service openibd restart
> +
> + If port link protocol was changed need to reset the fw as well:
How about:
If link type was changed, firmware must be reset as well:
> +
> + .. code-block:: console
> +
> + mlxfwreset -d <mst device> reset
> +
> + For Hypervisors, after reset write the sysfs number of Virtual Functions
Hypervisors => hypervisors
Virtual Functions => virtual functions (why all the caps?)
> + needed for the PF.
<< Inserting an empty line might make sense here.
> + The following is an example of a standard Linux kernel generated file that
> + is available in the new kernels:
You did not provide a specific kernel version. It's a rather old feature
actually, and since it is documented for almost all other PMDs, how about:
To dynamically instantiate a given number of virtual functions (VFs):
> +
> + .. code-block:: console
> +
> + echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
> +
> +
Extra empty line.
> +6. Enable MLX5 PMD in the ``.config`` file :
> +
> + .. code-block:: console
> +
> + CONFIG_RTE_LIBRTE_MLX5_PMD=y
> +
> +7. Compile DPDK and you are ready to go:
> +
> + .. code-block:: console
> +
> + make config T=<cpu arch, compiler, ..>
> + make
Same comments for 6. and 7. as their mlx4 counterparts.
> +
> +Limitations and Known issues
> +----------------------------
> +
> +- Inner RSS for VXLAN frames is not supported yet.
> +- Port statistics through software counters only.
> +- Hardware checksum RX offloads for VXLAN inner header are not supported yet.
> +- Secondary process RX is not supported.
> +- Flow pattern without any specific vlan will match for vlan packets as well:
I suggest leaving this section in its original spot.
>
> When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
> Meaning, the flow rule::
> @@ -350,6 +448,76 @@ Known issues
>
> Will match any ipv4 packet (VLAN included).
>
> +Performance tunning
> +-------------------
tunning => tuning
> +
> +1. Configure aggressive CQE Zipping for maximum performance
Missing period or colon.
> +
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> s CQE_COMPRESSION=1
> +
> + To set it back to the default CQE Zipping mode use
Missing period or colon.
> +
> + .. code-block:: console
> +
> + mlxconfig -d <mst device> s CQE_COMPRESSION=0
> +
> +2. In case of Virtualization:
Virtualization => virtualization
> +
> + - Make sure that Hypervisor kernel is 3.16 or newer.
Hypervisor => hypervisor
> + - Configure boot with "iommu=pt".
How about `` `` instead of ""?
> + - Use 1G huge pages.
> + - Make sure to allocate a VM on huge pages.
> + - Make sure to set CPU pinning.
> +
> +3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
> + for better performance. For Virtual Machines (VM), verify that the right CPU
"Virtual Machines (VM)" => either "virtual machines" of "VMs", I think the
reader understands what they are at this point.
> + and NUMA node are pinned for the VM according to the above. Run
And you should remove "for the VM".
> +
> + .. code-block:: console
> +
> + lstopo-no-graphics
> +
> + to identify the NUMA node to which the PCIe adapter is connected.
> +
> +4. If more than one adapter is used, and root complex capabilities enables
> + to put both adapters on the same NUMA node without PCI bandwidth degredation,
degredation => degradation
> + it is recommended to locate both adapters on the same NUMA node.
> + This in order to forward packets from one to the other without
> + NUMA performance penalty.
> +
> +5. Disable pause frames
Missing period or colon.
> +
> + .. code-block:: console
> +
> + ethtool -A <netdev> rx off tx off
> +
> +6. Verify IO non-posted prefetch is disabled by default. This can be checked
> + via the BIOS configuration. Please contact you server provider for more
> + information about the settings.
> +
> +.. hint::
> +
> + On Some machines, depends on the machine intergrator, it is beneficial
Some => some
intergrator => integrator
> + to set the PCI max read request parameter to 1K. This can be
> + done in the following way:
> +
> + To query the read request size use:
> +
> + .. code-block:: console
> +
> + setpci -s <NIC PCI address> 68.w
> +
> + If the output is different than 3XXX, set it by:
> +
> + .. code-block:: console
> +
> + setpci -s <NIC PCI address> 68.w=3XXX
> +
> + The XXX can be different on different systems. Make sure to configure
> + according to the setpci output.
> +
> Notes for testpmd
> -----------------
>
> --
> 2.12.0
>
--
Adrien Mazarguil
6WIND
More information about the dev
mailing list