[dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC

Jesper Wramberg jesper.wramberg at gmail.com
Mon Nov 2 11:59:31 CET 2015


Hi again,

Thank you for your input. I have now switched to using the raw_ethernet_bw
script as transmitter and the test-pmd as receiver. An immediate result I
discovered was that the raw_ethernet_bw tool achieves very similar TX
performance as my DPDK transmitter.


(note both cpu10 and mlx4_0 is on same numa node as wanted)
taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20 -s
1480 --dest_mac F4:52:14:7A:59:80
---------------------------------------------------------------------------------------
Post List requested - CQ moderation will be the size of the post list
---------------------------------------------------------------------------------------
                    Send Post List BW Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RawEth               Using SRQ      : OFF
 TX depth        : 128
 Post List       : 3
 CQ Moderation   : 3
 Mtu             : 1518[B]
 Link type       : Ethernet
 Gid index       : 0
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
**raw ethernet header****************************************

--------------------------------------------------------------
| Dest MAC         | Src MAC          | Packet Type          |
|------------------------------------------------------------|
| F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
|------------------------------------------------------------|

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
MsgRate[Mpps]
 1480       33242748         0.00               4691.58            3.323974
---------------------------------------------------------------------------------------


Running it with the 64 byte packets Olga specified gives me the following
result:

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
MsgRate[Mpps]
 64         166585650        0.00               1016.67            16.657163
---------------------------------------------------------------------------------------


The results are the same with and without flow control. I have followed the
Mellanox DPDK QSG and done everything in the performance section (except
the things regarding interrupts).

So to answer Olga's questions :-)

1: Unfortunately I can't. If I try the FW update complains since the cards
came with Dell configuration (PSID: DEL0A70000023).

2: In my final setup I need jumboframes but just for the sake of testing I
tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config.
This did not really change anything, neither in my initial setup nor the
one described above.

3: In the final setup, I plan to share the NICs between multiple
independent processes. For this reason, I wanted to use SR-IOV and
whitelist a single VF to each process. Anyway, for the tests above I have
used the PFs for simplicity.
(Side note: I discovered that multiple DPDK instances can use the same PCI
address which might eliminate the need for SR-IOV. I wonder how that works
:-))

So conclusively, isn't the raw_ethernet_bw tool supposed to have larger
output BW with 1480 byte packets ?

I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I can
mail this to anyone who have the time to look further into it.

Thank you for your help, best regards
Jesper

2015-11-01 11:05 GMT+01:00 Olga Shern <olgas at mellanox.com>:

> Hi Jesper,
>
> Several suggestions,
> 1.      Any chance you can install latest FW from Mellanox web site or the
> one that is included in OFED 3.1 version that you have downloaded?  The
> latest version is  2.35.5100.
> 2.      Please configure SGE_NUM=1 in DPDK config file  in case you don't
> need jumbo frames. This will improve performance.
> 3.      Not clear from your description, if you are running DPDK on VM ?
> Are you suing SRIOV ?
> 4.      I suggest you  to run  first,  testpmd application. The traffic
> generator can be raw_ethernet_bw application that coming with MLNX_OFED, it
> can generate L2, IPV4 and TCP/UDP packets
>         For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i
> 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
>         This will send L2 packets via mlx4_0 NIC port 1 , packet size =
> 64, for 10 sec, batch = 3 (-l)
>         You can see according to testpmd counters the performance.
>
> Please check Mellanox community  posts, I think they can help you.
> https://community.mellanox.com/docs/DOC-1502
>
> We also have performance suggestions in our QSG:
>
> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf
>
> Best Regards,
> Olga
>
>
> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date :
> samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <
> jesper.wramberg at gmail.com>  À : users at dpdk.org
>
> Hi all,
>
>
>
> I am experiencing some performance issues in a somewhat custom setup with
> two Mellanox ConnectX-3 NICs. I realize these issues might be due to the
> setup, but I was hoping someone might be able to pinpoint some possible
> problems/bottlenecks.
>
>
>
>
> The server:
>
> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on
> each socket). I have a minimal Centos 7.1.1503 installed with kernel-
> 3.10.0-229.
> Note that this kernel is re-build with most things disabled to minimize
> size, etc. It has infiniband enabled, however, and mlx4_core as a module
> (since nothing works otherwise). Finally, I have connected the two NICs
> from port 2 to port 2.
>
>
>
> The firmware:
>
> I have installed the latest firmware for the NICs from dell which is
> 2.34.5060.
>
>
>
> The drivers, modules, etc.:
>
> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used
> its rebuild feature to build it against the custom kernel. I have installed
> it using the --basic option since I just want libibverbs, libmlx4, kernel
> modules and openibd service stuff. The mlx4_core.conf is set for ethernet
> on all ports. Moreover, it is configured for flow steering mode -7 and a
> few VFs. I can restart the openibd service successfully and everything
> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
> only problems I have encountered at this stage is that the links doesn't
> always seem to come up unless I unplug and re-plug the cables.
>
>
>
> DPDK setup:
>
> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
> package. I build it using the default values for everything. Running the
> simple hello world example I can see that everything is initialized
> correctly, etc.
>
>
>
> Test setup:
>
> To test the performance of the NICs I have the following setup. Two
> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
> running on NIC B. All processes use virtual functions on their respective
> NICs. Depending on the test, the processes can either transmit or receive
> data. To transmit, I use a simple DPDK program which generates 32000
> packets and transmits them over and over until it has sent 640 million
> packets. Similarly, I use a simple DPDK program to receive which is
> basically the layer 2 forwarding example without re-transmission.
>
>
>
> First test:
>
> In my first test, P1 transmits data to P3 while the other processes are
> idle.
>
> Packet size: 1480 byte packets
>
> Flow control: On/Off, doesn’t matter I get same result.
>
> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
> 4.9Gbit/s
>
>
>
> Second test:
>
> I my second test, I attempt to increase the amount of data transmitted
> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>
> Packet size: 1480 byte packets
>
> Flow control: On/Off, doesn’t matter I get same result.
>
> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75
> Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>
>
>
>
>
> Does anyone has any idea what I am doing wrong here ? In the second test I
> would expect P1 to transmit with the same speed as in the first test. It
> seems that there is a bottleneck somewhere, however. I have left most
> things to their default values but have also tried tweaking queue sizes,
> number of queues, interrupts, etc. with no luck
>
>
>
>
>
> Best Regards,
>
> Jesper
>


More information about the users mailing list