dpdk mlx5 driver crash in rxq_cq_decompress_v

Xiaoping Yan (NSB) xiaoping.yan at nokia-sbell.com
Fri Oct 13 05:25:11 CEST 2023


Hi Alex.,

Is this 0x22/0x2d telling the error type?
Unexpected CQE error syndrome 0x22
Unexpected CQE error syndrome 0x2d
I find below, but it seems it is for windows (drivers\common\mlx5\windows\mlx5_win_defs.h), but I’m using linux.
enum {
    MLX5_CQE_SYNDROME_LOCAL_LENGTH_ERR      = 0x01,
    MLX5_CQE_SYNDROME_LOCAL_QP_OP_ERR       = 0x02,
    MLX5_CQE_SYNDROME_LOCAL_PROT_ERR        = 0x04,
    MLX5_CQE_SYNDROME_WR_FLUSH_ERR          = 0x05,
    MLX5_CQE_SYNDROME_MW_BIND_ERR           = 0x06,
    MLX5_CQE_SYNDROME_BAD_RESP_ERR          = 0x10,
    MLX5_CQE_SYNDROME_LOCAL_ACCESS_ERR      = 0x11,
    MLX5_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR      = 0x12,
    MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR     = 0x13,
    MLX5_CQE_SYNDROME_REMOTE_OP_ERR         = 0x14,
    MLX5_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR   = 0x15,
    MLX5_CQE_SYNDROME_RNR_RETRY_EXC_ERR     = 0x16,
    MLX5_CQE_SYNDROME_REMOTE_ABORTED_ERR        = 0x22,
};

Do I understand correctly with below commit “net/mlx5: ignore non-critical syndromes for Rx queue”
syndrome 0x22/0x2d will not cause rx queue reset?

commit aa67ed3084588e6ca12e9709a6cab021f0ffeba7
Author: Alexander Kozyrev akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>
Date:   Fri Jan 27 05:22:43 2023 +0200

    net/mlx5: ignore non-critical syndromes for Rx queue

    For non-fatal syndromes like LOCAL_LENGTH_ERR, the Rx queue reset
    shouldn't be triggered. Rx queue could continue with the next packets
    without any recovery. Only three syndromes warrant Rx queue reset:
    LOCAL_QP_OP_ERR, LOCAL_PROT_ERR and WR_FLUSH_ERR.
    Do not initiate a Rx queue reset in any other cases.
    Skip all non-critical error CQEs and continue with packet processing.



Br, Xiaoping

From: Alexander Kozyrev <akozyrev at nvidia.com>
Sent: 2023年7月27日 20:43
To: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com>
Cc: Matan Azrad <matan at nvidia.com>; users at dpdk.org
Subject: [External] RE: dpdk mlx5 driver crash in rxq_cq_decompress_v



CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See http://nok.it/nsb for additional information.




Hi Xiaoping, looks like sometimes you get a packet that exceeds VF MTU resulting in an error CQE and subsequent crash.
Could you please cherry-pick “547b239a21 net/mlx5: ignore non-critical syndromes for Rx queue” to your DPDK version?
This should fix the crash by not resetting the Rx queue in this scenario. Alternatively, you can set VF MTU to 9000 to match PF.

Regards,
Alex

From: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>
Sent: Tuesday, July 11, 2023 10:24 PM
To: Alexander Kozyrev <akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>>
Cc: Matan Azrad <matan at nvidia.com<mailto:matan at nvidia.com>>; users at dpdk.org<mailto:users at dpdk.org>
Subject: RE: dpdk mlx5 driver crash in rxq_cq_decompress_v

Hi Alex,

PF MTU is 9000, VF MTU is 2000 (tried with 1500 also and get same crash).
Here is the test topology.
Traffic pattern:
downlink: gtpu packet length 1236, throughput: 3.65Gbps
uplink: gtpu packet length 302, throughput 0.47Gbps

Crash seen in the l2hicu Test2 container.
This commit (547b239a21) you mentioned in the other mail chain is not included in my dpdk version.

[cid:image001.png at 01D9FDC7.0523ECE0]


Br, Xiaoping

From: Alexander Kozyrev <akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>>
Sent: 2023年7月12日 4:48
To: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>
Cc: Matan Azrad <matan at nvidia.com<mailto:matan at nvidia.com>>; users at dpdk.org<mailto:users at dpdk.org>
Subject: [External] RE: dpdk mlx5 driver crash in rxq_cq_decompress_v



CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See http://nok.it/nsb for additional information.




Hi Xiaoping, I cannot reproduce the issue locally, all the fixes for CQE recovery are the part of 22.11.2 already.
Would you mind sharing more information about your setup, test-case and traffic characteristics? Do you have VF/PF MTU mismatch?

Regards,
Alex

From: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>
Sent: Tuesday, July 4, 2023 10:57 PM
To: Alexander Kozyrev <akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>>
Subject: RE: dpdk mlx5 driver crash in rxq_cq_decompress_v

Hi Alex,

Here is the CQE

Br, Xiaoping

From: Alexander Kozyrev <akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>>
Sent: 2023年7月5日 9:45
To: Matan Azrad <matan at nvidia.com<mailto:matan at nvidia.com>>; Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>; users at dpdk.org<mailto:users at dpdk.org>; Dekel Peled <dekelp at nvidia.com<mailto:dekelp at nvidia.com>>
Subject: [External] RE: dpdk mlx5 driver crash in rxq_cq_decompress_v



CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See http://nok.it/nsb for additional information.




Hi Xiaoping, could you please forward the error CQE dump to me?
Would you mind elaborating more on your traffic pattern and test case scenario?
The following commit supposed to ignore MTU mismatch error between VF and PF:
547b239a21 net/mlx5: ignore non-critical syndromes for Rx queue

Regards,
Alex

From: Matan Azrad <matan at nvidia.com<mailto:matan at nvidia.com>>
Sent: Sunday, July 2, 2023 11:35 PM
To: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>; users at dpdk.org<mailto:users at dpdk.org>; Dekel Peled <dekelp at nvidia.com<mailto:dekelp at nvidia.com>>; Alexander Kozyrev <akozyrev at nvidia.com<mailto:akozyrev at nvidia.com>>
Subject: Re: dpdk mlx5 driver crash in rxq_cq_decompress_v

+ @Alexander Kozyrev<mailto:akozyrev at nvidia.com> to suggest.

קבל ‏Outlook עבור Android‏<https://aka.ms/AAb9ysg>
________________________________
From: Xiaoping Yan (NSB) <xiaoping.yan at nokia-sbell.com<mailto:xiaoping.yan at nokia-sbell.com>>
Sent: Monday, July 3, 2023 4:18:22 AM
To: users at dpdk.org<mailto:users at dpdk.org> <users at dpdk.org<mailto:users at dpdk.org>>; Matan Azrad <matan at nvidia.com<mailto:matan at nvidia.com>>; dekelp at nvidia.com<mailto:dekelp at nvidia.com> <dekelp at nvidia.com<mailto:dekelp at nvidia.com>>
Subject: RE: dpdk mlx5 driver crash in rxq_cq_decompress_v

External email: Use caution opening links or attachments



Hi,



@'dekelp at nvidia.com'<mailto:dekelp at nvidia.com>@'Matan Azrad'<mailto:matan at nvidia.com> Can you kindly suggest?

Thank you.



Br, Xiaoping



From: Xiaoping Yan (NSB)
Sent: 2023年6月27日 12:11
To: users at dpdk.org<mailto:users at dpdk.org>; 'Matan Azrad' <matan at nvidia.com<mailto:matan at nvidia.com>>; 'dekelp at nvidia.com' <dekelp at nvidia.com<mailto:dekelp at nvidia.com>>
Subject: dpdk mlx5 driver crash in rxq_cq_decompress_v



Hi,



dpdk version in use: 21.11.2



Mlx5 driver crashes in rxq_cq_decompress_v in traffic test after several minutes.

Stack trace:

(gdb) bt

#0  0x00007ffff58612bc in _mm_storeu_si128 (__B=..., __P=<optimized out>)

    at /usr/lib/gcc/x86_64-redhat-linux/12/include/emmintrin.h:739

#1  rxq_cq_decompress_v (rxq=rxq at entry=0x2abe5592f40, cq=cq at entry=0x2abe54fdb00, elts=elts at entry=0x2abe5594638)

    at ../dpdk-21.11/drivers/net/mlx5/mlx5_rxtx_vec_sse.h:142

#2  0x00007ffff5862c84 in rxq_burst_v (no_cq=<synthetic pointer>, err=0x7fffffffb848, pkts_n=4, pkts=<optimized out>,

    rxq=0x2abe5592f40) at ../dpdk-21.11/drivers/net/mlx5/mlx5_rxtx_vec.c:349

#3  mlx5_rx_burst_vec (dpdk_rxq=0x2abe5592f40, pkts=0x7fffffffbf80, pkts_n=32) at ../dpdk-21.11/drivers/net/mlx5/mlx5_rxtx_vec.c:393

#4  0x00005555556a0f41 in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7fffffffbf80, queue_id=0, port_id=1)

    at /usr/include/rte_ethdev.h:5721

…

Attached is the error log “Unexpected CQE error syndrome…” and dump file



I found there was a similar bug here: https://bugs.dpdk.org/show_bug.cgi?id=334

But the fix (88c0733535d6 extend Rx completion with error handling) should already been included, as I’m using 21.11.2

Also below commit (fix to 88c0733535d6) is already included in my dpdk version.

commit 60b254e3923d007bcadbb8d410f95ad89a2f13fa

Author: Matan Azrad matan at nvidia.com<mailto:matan at nvidia.com>

Date:   Thu Aug 11 19:51:55 2022 +0300



    net/mlx5: fix Rx queue recovery mechanism



Any suggestion?

Thank you.



Br, Xiaoping


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mails.dpdk.org/archives/users/attachments/20231013/1144b8ac/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 172082 bytes
Desc: image001.png
URL: <http://mails.dpdk.org/archives/users/attachments/20231013/1144b8ac/attachment-0001.png>


More information about the users mailing list