<div dir="ltr"><div>Hello Qiming, Beilei,</div><div><br></div><div>Could you please help us debug this issue? Anything that would help with getting to the bottom of anything that could go wrong during port init/cleanup would be appreciated - extra eal/testpmd options or even code changes (such as where could add extra debug messages).</div><div><br></div><div>Thanks,</div><div>Juraj<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 8, 2023 at 7:25 AM Juraj Linkeš <juraj.linkes@pantheon.tech> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Qiming, Beilei,<br>
<br>
Another reminder - are you looking at this by any chance?<br>
<br>
The high level short description is that testpmd/l3fwd breaks a link<br>
between two servers while VPP (using DPDK) doesn't. This leads us to<br>
believe there's a problem with testpmd/l3fwd/i40e driver in DPDK.<br>
<br>
Thanks,<br>
Juraj<br>
<br>
On Tue, Feb 21, 2023 at 12:18 PM Juraj Linkeš<br>
<juraj.linkes@pantheon.tech> wrote:<br>
><br>
> Hi Qiming,<br>
><br>
> Just a friendly reminder, would you please take a look?<br>
><br>
> Thanks,<br>
> Juraj<br>
><br>
><br>
> On Tue, Feb 7, 2023 at 3:10 AM Xing, Beilei <<a href="mailto:beilei.xing@intel.com" target="_blank">beilei.xing@intel.com</a>> wrote:<br>
> ><br>
> > Hi Qiming,<br>
> ><br>
> > Could you please help on this? Thanks.<br>
> ><br>
> > BR,<br>
> > Beilei<br>
> ><br>
> > > -----Original Message-----<br>
> > > From: Juraj Linkeš <juraj.linkes@pantheon.tech><br>
> > > Sent: Monday, February 6, 2023 4:53 PM<br>
> > > To: Singh, Aman Deep <<a href="mailto:aman.deep.singh@intel.com" target="_blank">aman.deep.singh@intel.com</a>>; Zhang, Yuying<br>
> > > <<a href="mailto:yuying.zhang@intel.com" target="_blank">yuying.zhang@intel.com</a>>; Xing, Beilei <<a href="mailto:beilei.xing@intel.com" target="_blank">beilei.xing@intel.com</a>><br>
> > > Cc: <a href="mailto:dev@dpdk.org" target="_blank">dev@dpdk.org</a>; Ruifeng Wang <<a href="mailto:Ruifeng.Wang@arm.com" target="_blank">Ruifeng.Wang@arm.com</a>>; Zhang, Lijian<br>
> > > <<a href="mailto:Lijian.Zhang@arm.com" target="_blank">Lijian.Zhang@arm.com</a>>; Honnappa Nagarahalli<br>
> > > <<a href="mailto:Honnappa.Nagarahalli@arm.com" target="_blank">Honnappa.Nagarahalli@arm.com</a>><br>
> > > Subject: Re: Testpmd/l3fwd port shutdown failure on Arm Altra systems<br>
> > ><br>
> > > Hello i40e and testpmd maintainers,<br>
> > ><br>
> > > A gentle reminder - would you please advise how to debug the issue described<br>
> > > below?<br>
> > ><br>
> > > Thanks,<br>
> > > Juraj<br>
> > ><br>
> > > On Fri, Jan 20, 2023 at 1:07 PM Juraj Linkeš <juraj.linkes@pantheon.tech><br>
> > > wrote:<br>
> > > ><br>
> > > > Adding the logfile.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > One thing that's in the logs but didn't explicitly mention is the DPDK version<br>
> > > we've tried this with:<br>
> > > ><br>
> > > > EAL: RTE Version: 'DPDK 22.07.0'<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We also tried earlier versions going back to 21.08, with no luck. I also did a<br>
> > > quick check on 22.11, also with no luck.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Juraj<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > From: Juraj Linkeš<br>
> > > > Sent: Friday, January 20, 2023 12:56 PM<br>
> > > > To: '<a href="mailto:aman.deep.singh@intel.com" target="_blank">aman.deep.singh@intel.com</a>' <<a href="mailto:aman.deep.singh@intel.com" target="_blank">aman.deep.singh@intel.com</a>>;<br>
> > > > '<a href="mailto:yuying.zhang@intel.com" target="_blank">yuying.zhang@intel.com</a>' <<a href="mailto:yuying.zhang@intel.com" target="_blank">yuying.zhang@intel.com</a>>; Xing, Beilei<br>
> > > > <<a href="mailto:beilei.xing@intel.com" target="_blank">beilei.xing@intel.com</a>><br>
> > > > Cc: <a href="mailto:dev@dpdk.org" target="_blank">dev@dpdk.org</a>; Ruifeng Wang <<a href="mailto:Ruifeng.Wang@arm.com" target="_blank">Ruifeng.Wang@arm.com</a>>; 'Lijian Zhang'<br>
> > > > <<a href="mailto:Lijian.Zhang@arm.com" target="_blank">Lijian.Zhang@arm.com</a>>; 'Honnappa Nagarahalli'<br>
> > > > <<a href="mailto:Honnappa.Nagarahalli@arm.com" target="_blank">Honnappa.Nagarahalli@arm.com</a>><br>
> > > > Subject: Testpmd/l3fwd port shutdown failure on Arm Altra systems<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Hello i40e and testpmd maintainers,<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We're hitting an issue with DPDK testpmd on Ampere Altra servers in FD.io<br>
> > > lab.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > A bit of background: along with VPP performance tests (which uses DPDK),<br>
> > > we're running a small number of basic DPDK testpmd and l3fwd tests in FD.io<br>
> > > as well. This is to catch any performance differences due to VPP updating its<br>
> > > DPDK version.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We're running both l3fwd tests and testpmd tests. The Altra servers are two<br>
> > > socket and the topology is TG -> DUT1 -> DUT2 -> TG, traffic flows in both<br>
> > > directions, but nothing gets forwarded (with a slight caveat - put a pin in this).<br>
> > > There's nothing special in the tests, just forwarding traffic. The NIC we're<br>
> > > testing is xl710-QDA2.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > The same tests are passing on all other testbeds - we have various two node<br>
> > > (1 DUT, 1 TG) and three node (2 DUT, 1 TG) Intel and Arm testbeds and with<br>
> > > various NICs (Intel 700 and 800 series and the Intel testbeds use some<br>
> > > Mellanox NICs as well). We don't have quite the same combination of another<br>
> > > three node topology with the same NIC though, so it looks like something with<br>
> > > testpmd/l3fwd and xl710-QDA2 on Altra servers.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > VPP performance tests are passing, but l3fwd and testpmd fail. This leads us<br>
> > > to believe to it's a software issue, but there could something wrong with the<br>
> > > hardware. I'll talk about testpmd from now on, but as far we can tell, the<br>
> > > behavior is the same for testpmd and l3fwd.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Getting back to the caveat mentioned earlier, there seems to be something<br>
> > > wrong with port shutdown. When running testpmd on a testbed that hasn't<br>
> > > been used for a while it seems that all ports are up right away (we don't see<br>
> > > any "Port 0|1: link state change event") and the setup works fine (forwarding<br>
> > > works). After restarting testpmd (restarting on one server is sufficient), the<br>
> > > ports between DUT1 and DUT2 (but not between DUTs and TG) go down and<br>
> > > are not usable in DPDK, VPP or in Linux (with i40e kernel driver) for a while<br>
> > > (measured in minutes, sometimes dozens of minutes; the duration is seemingly<br>
> > > random). The ports eventually recover and can be used again, but there's<br>
> > > nothing in syslog suggesting what happened.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > What seems to be happening is testpmd put the ports into some faulty state.<br>
> > > This only happens on the DUT1 -> DUT2 link though (the ports between the<br>
> > > two testpmds), not on TG -> DUT1 link (the TG port is left alone).<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Some more info:<br>
> > > ><br>
> > > > We've come across the issue with this configuration:<br>
> > > ><br>
> > > > OS: Ubuntu20.04 with kernel 5.4.0-65-generic.<br>
> > > ><br>
> > > > Old NIC firmware, never upgraded: 6.01 0x800035da 1.1747.0.<br>
> > > ><br>
> > > > Drivers versions: i40e 2.17.15 and iavf 4.3.19.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > As well as with this configuration:<br>
> > > ><br>
> > > > OS: Ubuntu22.04 with kernel 5.15.0-46-generic.<br>
> > > ><br>
> > > > Updated firmware: 8.30 0x8000a4ae 1.2926.0.<br>
> > > ><br>
> > > > Drivers: i40e 2.19.3 and iavf 4.5.3.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Unsafe noiommu mode is disabled:<br>
> > > ><br>
> > > > cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode<br>
> > > ><br>
> > > > N<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We used DPDK 22.07 in manual testing and built it on DUTs, using generic<br>
> > > build:<br>
> > > ><br>
> > > > meson -Dexamples=l3fwd -Dc_args=-DRTE_LIBRTE_I40E_16BYTE_RX_DESC=y<br>
> > > > -Dplatform=generic build<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We're running testpmd with this command:<br>
> > > ><br>
> > > > sudo build/app/dpdk-testpmd -v -l 1,2 -a 0004:04:00.1 -a 0004:04:00.0<br>
> > > > --in-memory -- -i --forward-mode=io --burst=64 --txq=1 --rxq=1<br>
> > > > --tx-offloads=0x0 --numa --auto-start --total-num-mbufs=32768<br>
> > > > --nb-ports=2 --portmask=0x3 --max-pkt-len=1518 --mbuf-size=16384<br>
> > > > --nb-cores=1<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > And l3fwd (with different macs on the other server):<br>
> > > ><br>
> > > > sudo /tmp/openvpp-testing/dpdk/build/examples/dpdk-l3fwd -v -l 1,2 -a<br>
> > > > 0004:04:00.0 -a 0004:04:00.1 --in-memory -- --parse-ptype<br>
> > > > --eth-dest="0,40:a6:b7:85:e7:79" --eth-dest="1,3c:fd:fe:c3:e7:a1"<br>
> > > > --config="(0, 0, 2),(1, 0, 2)" -P -L -p 0x3<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We tried adding logs with --log-level=pmd,debug and --no-lsc-interrupt, but<br>
> > > that didn't reveal anything helpful, as far as we can tell - please have a look at<br>
> > > the attached log. The faulty port is port0 (starts out as down, then we waited<br>
> > > for around 25 minutes for it to go up and then we shut down testpmd).<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > We'd like to ask for pointers on what could be the cause or how to debug<br>
> > > this issue further.<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Thanks,<br>
> > > > Juraj<br>
</blockquote></div>