[dpdk-users] Performance degradation in dpdk 2.2 when using multiple NICs on different CPU sockets

Take Ceara dumitru.ceara at gmail.com
Thu May 12 16:57:47 CEST 2016


We're working on a project where we use DPDK and our own TCP
implementation on top in order to setup a high number of TCP
sessions/s between DPDK-controlled ethernet ports.

Our reference hardware platform is:
- Super X10DRX dual socket motherboard
- 2 Intel E5-2660 v3 Processor (10 cores * 2hw threads)
- 128GB RAM, using 16x 8G DDR4 2133Mhz to fill all the memory slots
- 2 40G Intel XL710-QDA1 adapters

The CPU layout is:
$ $RTE_SDK/tools/cpu_layout.py
Core and Socket Information (as reported by '/proc/cpuinfo')

cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
sockets =  [0, 1]

         Socket 0        Socket 1
         --------        --------
Core 0  [0, 20]         [10, 30]
Core 1  [1, 21]         [11, 31]
Core 2  [2, 22]         [12, 32]
Core 3  [3, 23]         [13, 33]
Core 4  [4, 24]         [14, 34]
Core 8  [5, 25]         [15, 35]
Core 9  [6, 26]         [16, 36]
Core 10 [7, 27]         [17, 37]
Core 11 [8, 28]         [18, 38]
Core 12 [9, 29]         [19, 39]

Following the DPDK performance guidelines [1], according to section
7.2, point 3:
"Note: To get the best performance, ensure that the core and NICs are
in the same socket. In the example above 85:00.0 is on socket 1 and
should be used by cores on socket 1 for the best performance."

For testing we connected the two 40G ports back to back and decided to
install them on different sockets. We made sure that the NICs are
controlled by cores that are in the same socket:
PCI 02:00.0 (socket 0) -> cores: 2-9 (all on socket 0)
PCI 82:00.0 (socket 1) -> cores: 12-19 (all on socket 1)

With this configuration our implementation could achieve a TCP setup
rate (assuming NIC0 running the clients and NIC1 running the servers)
of ~3.2M sessions/s.
According to our previous benchmarks on single socket servers this was
a really low performance as we were expecting around 12M sessions/s.

(At least in theory) our implementation should scale almost linearly:
- we use RSS hashing to distribute traffic between queues and
completely split the TCP stack into independent per core-queue stacks.
- there's no locking between any of the cores handling the queues.
- there's virtually no atomic variable usage while generating the traffic.
- all the memory used for generating the TCP sessions is allocated
from the local socket of the core.
- we use per socket mbuf pools.

I then found this note [1] (section 7.1.1):
"Care should be take with NUMA. If you are using 2 or more ports from
different NICs, it is best to ensure that these NICs are on the same
CPU socket. An example of how to determine this is shown further

We then moved both NICs on socket 1 and used the following configuration:
0000:83:00.0 (socket 1) -> cores: 12-15 (socket 1)
0000:84:00.0 (socket 1) -> cores: 16-19 (socket 1)

In this case the setup rate scaled almost linearly to ~6M sessions/s
as we originally expected.

I thought initially that the performance drop was due to the way the
driver allocates and polls the queues. However, when going a bit
through the i40e driver code, as far as I see, all the memory
allocations are also done based on the socket-id that's passed when
setting up the RX queues (i40e_dev_rx_queue_setup) which contradicts
my guess.

We'd like to eventually use both sockets at the same time and this
performance degradation raises a problem.
What are the alternatives to overcome this limitation?

Would running two DPDK instances (e.g., in VMs) have the same limitation?

Dumitru Ceara

[1] http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html

More information about the users mailing list