[dpdk-dev] [dpdk-users] imissed drop with mellanox connectx5

Gerry Wan gerryw at stanford.edu
Sat Jul 24 08:31:55 CEST 2021


My understanding of an increasing imissed counter is that it indicates your
processing logic in the lcore is not fast enough to handle the rate of
incoming packets, and is independent of the number of free mbufs. I would
guess that using the pipeline model (passing mbufs between lcores via
rings) involves some cross core communication that causes cache misses (as
mentioned by Matan). A run-to-completion model may very well perform
better, although it probably depends on your entire workflow.

On Thu, Jul 22, 2021 at 3:34 AM Yaron Illouz <yaroni at radcom.com> wrote:

> Hi Matan
>
> We work with mbuf in all threads and lcores,
> We pass them from one thread to another through the dpdk ring before
> releasing them.
> There are drops in 10K to 100K pps, we can't stay with these drops.
>
> The drops are in the imissed counter from rte_eth_stats_get, so I thought
> that the drops are at the port level and not drop at mempool level
> From what I see number of mbuf in pool is stable( and close to the
> total/original number of mbuf in pool), the rings are empty, Traffic is
> well balanced between threads, All threads are running in pool from port
> and from ring.
> And from perf top profiler there doesn't seem to be any unexpected
> function taking cpu.
>
> So the only possible architecture would be to implement all logic in the
> threads that read from port, and to launch hundreds of threads in
> multiqueue mode that read from port? I don't think this is a viable
> solution ( In the following link for example they show an example of
> application that pass packet from one core/thread to another
> https://doc.dpdk.org/guides-16.04/sample_app_ug/qos_scheduler.html )
>
> Thank you answer
>
> -----Original Message-----
> From: Matan Azrad <matan at nvidia.com>
> Sent: Thursday, July 22, 2021 8:19 AM
> To: Yaron Illouz <yaroni at radcom.com>; users at dpdk.org
> Cc: dev at dpdk.org
> Subject: RE: imissed drop with mellanox connectx5
>
> Hi Yaron
>
> Freeing mbufs from a different lcore than the original lcore allocated
> them causes cache miss in the mempool cache of the original lcore per mbuf
> allocation - all the time the PMD will get non-hot mbufs to work with.
>
> It can be one of the reasons for the earlier drops you see.
>
> Matan
>
> From: Yaron Illouz
> > Hi
> >
> > We try to read from 100G NIC Mellanox ConnectX-5  without drop at nic.
> > All thread are with core pinning and cpu isolation.
> > We use dpdk 19.11
> > I tried to apply all configuration that are in
> > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffast
> > .dpdk.org%2Fdoc%2Fperf%2FDPDK_19_08_Mellanox_NIC_performance_r&dat
> > a=04%7C01%7C%7Cdcbb2d8246be4dc456c508d94cd038a7%7C0eb9e2d98763412e9709
> > 3f539e9e25bc%7C0%7C0%7C637625279453292671%7CUnknown%7CTWFpbGZsb3d8eyJW
> > IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&
> > amp;sdata=KMBFyIMEFV4B0JqxQE%2BiMXJ2p9qE8lEOpUWRsFhD0gM%3D&reserve
> > d=0
> > eport.pdf
> >
> > We have a strange behavior, 1 thread can receive receive 20 Gbps/12
> > Mpps and free mbuf without dropps,  but when trying to pass these mbuf
> > to another thread that only free them there are drops, even when
> > trying to work with more threads.
> >
> > When running 1 thread that only read from port (no multi queue) and
> > free mbuf in the same thread, there are no dropp with traffic up to 21
> > Gbps  12.4 Mpps.
> > When running 6 thread that only read from port (with multi queue) and
> > free mbuf in the same threads, there are no dropp with traffic up to
> > 21 Gbps  12.4 Mpps.
> >
> > When running 1 to 6 thread that only read from port and pass them to
> > another 6 thread that only read from ring and free mbuf, there are
> > dropp in nic (imissed counter) with traffic over to 10 Gbps  5.2
> > Mpps.(Here receive thread were pinned to cpu 1-6 and additional thread
> > from 7-12 each thread on a single cpu) Each receive thread send to one
> thread that free the buffer.
> >
> > Configurations:
> >
> > We use rings of size 32768 between the threads. Ring are initialized
> > with SP/SC, Write are done with bulk of 512 with rte_ring_enqueue_burst.
> > Port is initialized with rte_eth_rx_queue_setup nb_rx_desc=8192
> > rte_eth_rxconf - rx_conf.rx_thresh.pthresh = DPDK_NIC_RX_PTHRESH;
> > //ring prefetch threshold
> >                                 rx_conf.rx_thresh.hthresh =
> > DPDK_NIC_RX_HTHRESH; //ring host threshold
> >                                 rx_conf.rx_thresh.wthresh =
> > DPDK_NIC_RX_WTHRESH; //ring writeback threshold
> >                                 rx_conf.rx_free_thresh =
> > DPDK_NIC_RX_FREE_THRESH; rss -
> > >  ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP;
> >
> >
> > We tried to work with and without hyperthreading.
> >
> > ****************************************
> >
> > Network devices using kernel driver
> > ===================================
> > 0000:37:00.0 'MT27800 Family [ConnectX-5] 1017' if=ens2f0
> > drv=mlx5_core unused=igb_uio
> > 0000:37:00.1 'MT27800 Family [ConnectX-5] 1017' if=ens2f1
> > drv=mlx5_core unused=igb_uio
> >
> > ****************************************
> >
> > ethtool -i ens2f0
> > driver: mlx5_core
> > version: 5.3-1.0.0
> > firmware-version: 16.30.1004 (HPE0000000009)
> > expansion-rom-version:
> > bus-info: 0000:37:00.0
> > supports-statistics: yes
> > supports-test: yes
> > supports-eeprom-access: no
> > supports-register-dump: no
> > supports-priv-flags: yes
> >
> > ****************************************
> >
> > uname -a
> > Linux localhost.localdomain 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19
> > 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> >
> > ****************************************
> >
> > lscpu | grep -e Socket -e Core -e Thread
> > Thread(s) per core:    1
> > Core(s) per socket:    24
> > Socket(s):             2
> >
> > ****************************************
> > cat /sys/devices/system/node/node0/cpulist
> > 0-23
> > ****************************************
> > From /proc/cpuinfo
> >
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 85
> > model name      : Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
> > stepping        : 7
> > microcode       : 0x5003003
> > cpu MHz         : 2200.000
> >
> > ****************************************
> >
> > python /home/cpu_layout.py
> > ==========================================================
> > ============
> > Core and Socket Information (as reported by '/sys/devices/system/cpu')
> > ==========================================================
> > ============
> >
> > cores =  [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 16, 17, 18, 19,
> > 20, 21, 25, 26, 27, 28, 29, 24] sockets =  [0, 1]
> >
> >         Socket 0    Socket 1
> >         --------    --------
> > Core 0  [0]         [24]
> > Core 1  [1]         [25]
> > Core 2  [2]         [26]
> > Core 3  [3]         [27]
> > Core 4  [4]         [28]
> > Core 5  [5]         [29]
> > Core 6  [6]         [30]
> > Core 8  [7]
> > Core 9  [8]         [31]
> > Core 10 [9]         [32]
> > Core 11 [10]        [33]
> > Core 12 [11]        [34]
> > Core 13 [12]        [35]
> > Core 16 [13]        [36]
> > Core 17 [14]        [37]
> > Core 18 [15]        [38]
> > Core 19 [16]        [39]
> > Core 20 [17]        [40]
> > Core 21 [18]        [41]
> > Core 25 [19]        [43]
> > Core 26 [20]        [44]
> > Core 27 [21]        [45]
> > Core 28 [22]        [46]
> > Core 29 [23]        [47]
> > Core 24             [42]
>


More information about the dev mailing list