[mlx5] CX6 NIC bug, the process exits abnormally.
jiangheng (G)
jiangheng14 at huawei.com
Thu Sep 14 14:08:37 CEST 2023
Hi
During the pressure test on the CX6 using DPDK, the process exits abnormally. It is located that the problem is caused by a bug of the DPDK mlx5 driver. Please check whether the latest firmware and driver fix this coredump.
By default, the DPDK enables the rxtx_vect and compress CQE functions, and the receive ringbuffer is 1024. During the service process pressure, the service process receives SIGFAULT and exits.
Call stack information:
#2 0x0000000000e72437 in signal_captured_function (signo=11, si=0x7f6310f46eb0, ucontext=0x7f6310f46d80) at ../v1/handle_signal.c:499
#3 <signal handler called>
#4 _mm_storeu_si128 (__B=..., __P=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/include/emmintrin.h:720
#5 rxq_cq_decompress_v (elts=0x20217ff394e8, cq=0x20217f8538c0, rxq=0x20217ff36e00) at ../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159
#6 rxq_burst_v (no_cq=<synthetic pointer>, err=<synthetic pointer>, pkts_n=9, pkts=0x2004e278c9d8, rxq=0x20217ff36e00) at ../drivers/net/mlx5/mlx5_rxtx_vec.c:349
#7 mlx5_rx_burst_vec (dpdk_rxq=0x20217ff36e00, pkts=0x2004e278c9d8, pkts_n=128) at ../drivers/net/mlx5/mlx5_rxtx_vec.c:393
#8 0x0000000001086448 in rte_eth_rx_burst (nb_pkts=128, rx_pkts=0x2004e278c9d8, queue_id=7, port_id=<optimized out>) at ../include/dpdk/rte_ethdev.h:5339
Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Version:
[root at localhost ~]# ofed_info -s
MLNX_OFED_LINUX-23.04-0.5.3.3:
[root at localhost ~]# ethtool -i eth6|grep fir
firmware-version: 22.37.1014 (MT_0000000359)
dpdk version: DPDK 21.11
../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159
157: /* B.1 store rearm data to mbuf. */
158: _mm_storeu_si128((__m128i *)&elts[pos + 2]->rearm_data, rearm);
159: _mm_storeu_si128((__m128i *)&elts[pos + 3]->rearm_data, rearm);
Root cause: When processing compressed CQEs, 9 mini CQEs need to be processed and (*rxq->elts)[1021] -> (*rxq->elts)[1028] is accessed. Only [0, 1027] are reserved during the initialization of the receive queue. A null pointer is accessed due to out-of-bounds access. As a result, a core dump occurs in the process.
[cid:image001.png at 01D9E747.3D977290]
(gdb) p elts[0]
$149 = (struct rte_mbuf *) 0x2006945a8000 //first round
(gdb) p elts[1]
$150 = (struct rte_mbuf *) 0x2006945aa1c0
(gdb) p elts[2]
$151 = (struct rte_mbuf *) 0x2006945ac380
(gdb) p elts[3]
$152 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[4]
$153 = (struct rte_mbuf *) 0x20217ff36f80 //Second round
(gdb) p elts[5]
$154 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[6]
$155 = (struct rte_mbuf *) 0x20217ff36f80
(gdb) p elts[7]
$156 = (struct rte_mbuf *) 0x0 //coredump
(gdb) p elts - (*rxq->elts)
$157 = 1021
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mails.dpdk.org/archives/users/attachments/20230914/f6501efb/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 35072 bytes
Desc: image001.png
URL: <http://mails.dpdk.org/archives/users/attachments/20230914/f6501efb/attachment-0001.png>
More information about the users
mailing list