<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoPlainText">Hi Gaetan<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Sorry for very late reply, we were busy working on 21.11 integration.<o:p></o:p></p>
<p class="MsoPlainText">Although we have adopted this code internally for us but I am sharing the patch to opensource for community benefit.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">This is specific case of AZURE setup with our very customized complex environment.<o:p></o:p></p>
<p class="MsoPlainText"><o:p></o:p></p>
<p class="MsoPlainText">Let me share the logs with traceback first<o:p></o:p></p>
<table class="MsoTableGrid" border="1" cellspacing="0" cellpadding="0" style="border-collapse:collapse;border:none">
<tbody>
<tr>
<td width="875" valign="top" style="width:656.6pt;border:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt">
<p class="MsoPlainText">SECONDARY PROCESS<o:p></o:p></p>
<p class="MsoPlainText">timestamp=1633598184<o:p></o:p></p>
<p class="MsoPlainText">TCZ0.0.0 Cycle 152 (Build 1832)<o:p></o:p></p>
<p class="MsoPlainText">signal 11 (Segmentation fault), address is 0x31117bbce6c8 from 0x47d08b1<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">[bt]: ( 1) _Z18snprintf_backtraceRPciiP9siginfo_tPv (+ 0xf4) - sp = 0x7fffef3fd110, ip = 0x3acdc54<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 2) _Z13crit_err_hdlriP9siginfo_tPv (+ 0x159) - sp = 0x7fffef3fdc20, ip = 0x3acdf29<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 3) _ZN13SignalAdapter12handleSignalEiP9siginfo_tPv (+ 0x104) - sp = 0x7fffef3fdf00, ip = 0x274d4c4<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 4) _L_unlock_18 (+ 0x2c) - sp = 0x7fffef3fdf80, ip = 0x7ffff7bce630<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 5) rte_eth_dev_attach_secondary (+ 0x21) - sp = 0x7fffef3fec50, ip = 0x47d08b1<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 6) rte_eth_from_ring (+ 0x3438) - sp = 0x7fffef3fec80, ip = 0x4e49da8<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 7) _init (+ 0xa1b8) - sp = 0x7fffef3feec0, ip = 0x12e0368<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 8) local_dev_probe (+ 0xac) - sp = 0x7fffef3feef0, ip = 0x478fd2c<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 9) rte_uuid_unparse (+ 0x274) - sp = 0x7fffef3fef30, ip = 0x47a3e94<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (10) rte_eal_vfio_get_vf_token (+ 0xd7) - sp = 0x7fffef3ff110, ip = 0x47b04b7<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (11) eal_hugepage_info_read (+ 0x602) - sp = 0x7fffef3ff170, ip = 0x47b2cd2<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (12) start_thread (+ 0xc5) - sp = 0x7fffef3ff220, ip = 0x7ffff7bc6ea5<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (13) clone (+ 0x6d) - sp = 0x7fffef3ff2c0, ip = 0x7ffff004096d<o:p></o:p></p>
<p class="MsoPlainText">EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket:eal_dev_mp_request<o:p></o:p></p>
<p class="MsoPlainText">EAL: Cannot send request to primary<o:p></o:p></p>
<p class="MsoPlainText">EAL: Failed to send hotplug request to primary<o:p></o:p></p>
<p class="MsoPlainText">net_failsafe: Failed to probe devargs net_tap_vsc0<o:p></o:p></p>
<p class="MsoPlainText">EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket:eal_dev_mp_request<o:p></o:p></p>
<p class="MsoPlainText">EAL: Cannot send request to primary<o:p></o:p></p>
<p class="MsoPlainText">EAL: Failed to send hotplug request to primary<o:p></o:p></p>
<p class="MsoPlainText">net_failsafe: Failed to probe devargs net_tap_vsc1<o:p></o:p></p>
<p class="MsoPlainText">EAL: No legacy callbacks, legacy socket not created<o:p></o:p></p>
<p class="MsoPlainText">EAL: Drop mp reply: eal_dev_mp_request<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">PRIMARY PROCESS<o:p></o:p></p>
<p class="MsoPlainText">timestamp=1633598196<o:p></o:p></p>
<p class="MsoPlainText">TCZ0.0.0 Cycle 152 (Build 1832)<o:p></o:p></p>
<p class="MsoPlainText">signal 11 (Segmentation fault), address is 0x38 from 0x9d8fbe<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">[bt]: ( 1) _Z18snprintf_backtraceRPciiP9siginfo_tPv (+ 0xf4) - sp = 0x7fffecf41150, ip = 0x100dd44<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 2) _Z13crit_err_hdlriP9siginfo_tPv (+ 0x159) - sp = 0x7fffecf41c60, ip = 0x100e019<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 3) _ZN13SignalAdapter12handleSignalEiP9siginfo_tPv (+ 0x104) - sp = 0x7fffecf41f40, ip = 0xff4894<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 4) _L_unlock_18 (+ 0x2c) - sp = 0x7fffecf41fc0, ip = 0x7ffff61d9630<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 5) failsafe_eth_dev_close (+ 0x65e) - sp = 0x7fffecf42c90, ip = 0x9d8fbe<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 6) rte_eth_link_get_nowait (+ 0x6a) - sp = 0x7fffecf42cf0, ip = 0x62fa0a<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 7) _ZN11StatsThread9statsLoopEP10CustomObject (+ 0x33e) - sp = 0x7fffecf42d20, ip = 0xedea2e<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 8) _ZN11StatsThread9statsLoopEP10CustomObject (+ 0x8dc) - sp = 0x7fffecf42d90, ip = 0xedefcc<o:p></o:p></p>
<p class="MsoPlainText">[bt]: ( 9) ThreadFunction (+ 0xe6) - sp = 0x7fffecf42db0, ip = 0x7ffff6b477e6<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (10) start_thread (+ 0xc5) - sp = 0x7fffecf42de0, ip = 0x7ffff61d1ea5<o:p></o:p></p>
<p class="MsoPlainText">[bt]: (11) clone (+ 0x6d) - sp = 0x7fffecf42e80, ip = 0x7ffff0a6b96d<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">DPDK 20.11.2<o:p></o:p></p>
<p class="MsoPlainText">core mask is 00000000000000000000000000004000<o:p></o:p></p>
<p class="MsoPlainText">DPDK Custom Process initialized with 2 ports<o:p></o:p></p>
<p class="MsoPlainText">the min max TxQ is maxTxQueues 16<o:p></o:p></p>
<p class="MsoPlainText">Using 1 RxQs for port 0 (# F-core=1)<o:p></o:p></p>
<p class="MsoPlainText">Using 1 RxQs for port 3 (# F-core=1)<o:p></o:p></p>
<p class="MsoPlainText">Core 14 (port=0, rxQ=0) kni_ring=(nil)<o:p></o:p></p>
<p class="MsoPlainText">Core 14 (port=3, rxQ=0) kni_ring=(nil)<o:p></o:p></p>
<p class="MsoPlainText">Core 14 txN = 0<o:p></o:p></p>
<p class="MsoPlainText">Thread for core 14 using ring from usbc of 0x31117b29bb00<o:p></o:p></p>
<p class="MsoPlainText">Ring size must be powers of 2, adjusting from 8196 to 16384<o:p></o:p></p>
<p class="MsoPlainText">Thread for core 14 using ring from MEDIA of 0x31117b27b840<o:p></o:p></p>
<p class="MsoPlainText">Encaps Memory Zone= 48044 sizeof encaps = 60<o:p></o:p></p>
<p class="MsoPlainText">Trace Memory Zone= 272<o:p></o:p></p>
<p class="MsoPlainText">Policy Memory Zone= 8196 sizeof policy = 240<o:p></o:p></p>
<p class="MsoPlainText">link status for port 0 is 1<o:p></o:p></p>
<p class="MsoPlainText">link status for port 3 is 1<o:p></o:p></p>
<p class="MsoPlainText">PORT 0 supports 16 rx queues and 16 tx queues (driver_name = net_failsafe, driver_type = 16)<o:p></o:p></p>
<p class="MsoPlainText">PORT 0 is polling for link-change, interrupts disabled<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] tap_flow_create(): Kernel refused TC filter rule creation (17): File exists<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] net_failsafe: Failed to create flow on sub_device 1<o:p></o:p></p>
<p class="MsoPlainText">add_flow(): create() fails for port 0; Reason: overlapping rules or Kernel too old for flower support<o:p></o:p></p>
<p class="MsoPlainText">Error adding broadcast flow<o:p></o:p></p>
<p class="MsoPlainText">PORT 3 supports 16 rx queues and 16 tx queues (driver_name = net_failsafe, driver_type = 16)<o:p></o:p></p>
<p class="MsoPlainText">PORT 3 is polling for link-change, interrupts disabled<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: Failed to hotplug add device on primary<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] tap_flow_create(): Kernel refused TC filter rule creation (17): File exists<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] net_failsafe: Failed to create flow on sub_device 1<o:p></o:p></p>
<p class="MsoPlainText">add_flow(): create() fails for port 3; Reason: overlapping rules or Kernel too old for flower support<o:p></o:p></p>
<p class="MsoPlainText">Error adding broadcast flow<o:p></o:p></p>
<p class="MsoPlainText">Cmd Thread is available<o:p></o:p></p>
<p class="MsoPlainText">Capture object initialized<o:p></o:p></p>
<p class="MsoPlainText">init :Stats Thread is available<o:p></o:p></p>
<p class="MsoPlainText">ifLinkUpdate: Sending OperStatus for port=0 stat=1<o:p></o:p></p>
<p class="MsoPlainText">ifLinkUpdate: Port 0 Link Change - speed 40000 Mbps - full-duplex<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket_2934_298e9db8d1:eal_dev_mp_request<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: rte_mp_request_sync failed<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: Failed to send hotplug request to secondary<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: Fail to recv reply for request /var/run/dpdk/oracusbc/mp_socket_2934_298e9db8d1:eal_dev_mp_request<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: rte_mp_request_sync failed<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] EAL: Failed to hotplug add device on primary<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] Invalid port_id=2<o:p></o:p></p>
<p class="MsoPlainText">[DPDK] net_failsafe: Operation rte_eth_stats_get failed for sub_device 1 with error -19<o:p></o:p></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">There is some race at secondary process and primary got crashed because its data-structures and partially filled.<o:p></o:p></p>
<p class="MsoPlainText">Let me know if you need GDB analysis, I can share with next reply if you are still unsatisfied. GDB analysis will be bigger.<o:p></o:p></p>
<p class="MsoPlainText">Thanks!<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Regards<o:p></o:p></p>
<p class="MsoPlainText">Vipul<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">-----Original Message-----<br>
From: Gaëtan Rivet <grive@u256.net> <br>
Sent: Monday, November 22, 2021 3:53 PM<br>
To: Vipul Ashri <vipul.ashri@oracle.com>; dev@dpdk.org<br>
Cc: stable@dpdk.org<br>
Subject: [External] : Re: [PATCH v2] net/failsafe: link_update request crashing at boot</p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">On Thu, Oct 21, 2021, at 23:42, <a href="mailto:vipul.ashri@oracle.com">
<span style="color:windowtext;text-decoration:none">vipul.ashri@oracle.com</span></a> wrote:<o:p></o:p></p>
<p class="MsoPlainText">> From: Vipul Ashri <<a href="mailto:vipul.ashri@oracle.com"><span style="color:windowtext;text-decoration:none">vipul.ashri@oracle.com</span></a>><o:p></o:p></p>
<p class="MsoPlainText">><o:p> </o:p></p>
<p class="MsoPlainText">> failsafe crashed while sending early link_update request during boot
<o:p></o:p></p>
<p class="MsoPlainText">> time initialization.<o:p></o:p></p>
<p class="MsoPlainText">> Based on debugging we found failsafe device was good but sub- devices
<o:p></o:p></p>
<p class="MsoPlainText">> were progressing towards initialization and SUBOPS macro where
<o:p></o:p></p>
<p class="MsoPlainText">> expanding macro gives [partial_dev]->dev_ops->link_update()<o:p></o:p></p>
<p class="MsoPlainText">> execution of which triggered crash because dev_ops==0. similar crash
<o:p></o:p></p>
<p class="MsoPlainText">> seen at failsafe_eth_dev_close()<o:p></o:p></p>
<p class="MsoPlainText">><o:p> </o:p></p>
<p class="MsoPlainText">> Failsafe driver need a separate check for subdevices similar to
<o:p></o:p></p>
<p class="MsoPlainText">> "RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);" which is called
<o:p></o:p></p>
<p class="MsoPlainText">> to almost every eth_dev function.<o:p></o:p></p>
<p class="MsoPlainText">><o:p> </o:p></p>
<p class="MsoPlainText">> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")<o:p></o:p></p>
<p class="MsoPlainText">> Cc: <a href="mailto:stable@dpdk.org"><span style="color:windowtext;text-decoration:none">stable@dpdk.org</span></a><o:p></o:p></p>
<p class="MsoPlainText">> Signed-off-by: Vipul Ashri <<a href="mailto:vipul.ashri@oracle.com"><span style="color:windowtext;text-decoration:none">vipul.ashri@oracle.com</span></a>><o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Hello Vipul,<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">I'm sorry for the delay, I missed your fix on the mailing list.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">IIUC, the issue is that failsafe finished init and received an ethdev operation call, but one of its sub-device, although marked DEV_ACTIVE, has its eth_dev->dev_ops field NULL.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">It is really surprising to me, because there aren't many ways for a sub-device to become DEV_ACTIVE.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">The only two ways are<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> * by executing 'fs_dev_configure()', which will first execute<o:p></o:p></p>
<p class="MsoPlainText"> rte_eth_dev_configure() on the sub-device, and on error would<o:p></o:p></p>
<p class="MsoPlainText"> stop *without* setting DEV_ACTIVE.<o:p></o:p></p>
<p class="MsoPlainText"> rte_eth_dev_configure() will itself execute<o:p></o:p></p>
<p class="MsoPlainText"> RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV), so it would<o:p></o:p></p>
<p class="MsoPlainText"> return negative errno and fs_dev_configure() would abort.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> * by executing 'fs_dev_remove()' and the sub-device was 'DEV_STARTED'<o:p></o:p></p>
<p class="MsoPlainText"> to begin with, then it is retrograded to DEV_ACTIVE once stopped.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">So I don't understand yet how it is possible for a sub-device to become DEV_ACTIVE while its eth_dev->dev_ops are NULL. It seems more like a bug, memory corruption or just an unexpected execution pattern.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Could describe in more detail the execution?<o:p></o:p></p>
<p class="MsoPlainText">In particular, setting the EAL log-level to debug with the option:<o:p></o:p></p>
<p class="MsoPlainText">' --log-level pmd.net.failsafe:debug '<o:p></o:p></p>
<p class="MsoPlainText">for example while using testpmd or your DPDK app.<o:p></o:p></p>
<p class="MsoPlainText">It should show ethdev level accesses to the sub-devices, and error values.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Best regards,<o:p></o:p></p>
<p class="MsoPlainText">--<o:p></o:p></p>
<p class="MsoPlainText">Gaetan Rivet<o:p></o:p></p>
</div>
</body>
</html>