[dpdk-dev] [EXT] RE: [dpdk-stable] [PATCH] lib/distributor: fix deadlock issue for aarch64

Ruifeng Wang (Arm Technology China) Ruifeng.Wang at arm.com
Thu Oct 17 15:48:52 CEST 2019


Hi Harman,

Thank you for testing this.

> -----Original Message-----
> From: Harman Kalra <hkalra at marvell.com>
> Sent: Thursday, October 17, 2019 19:42
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang at arm.com>
> Cc: David Marchand <david.marchand at redhat.com>; Aaron Conole
> <aconole at redhat.com>; David Hunt <david.hunt at intel.com>; dev
> <dev at dpdk.org>; Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; nd
> <nd at arm.com>; dpdk stable <stable at dpdk.org>
> Subject: Re: [EXT] RE: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix
> deadlock issue for aarch64
> 
> Hi
> 
> I tested this patch, following are my observations:
> 1. With this patch distributor_autotest getting suspended on arm64 platform
> is resolved. But continous execution of this test results in test failure, as
> reported by Aaron.
> 2. While testing on x86 platform, still I can observe distributor_autotest
> getting suspeneded(stuck) on continous execution of the test (it took almost
> 7-8 iterations to reproduce the suspension).

Yes, this v1 patch is not complete to solve the issue.
I have posted v3:
http://patches.dpdk.org/project/dpdk/list/?series=6856
With the new patch set, I didn't observe test failure in my test.
Will you try that?

Thanks.
/Ruifeng
> 
> Thanks
> 
> On Wed, Oct 09, 2019 at 05:52:03AM +0000, Ruifeng Wang (Arm Technology
> China) wrote:
> > External Email
> >
> > ----------------------------------------------------------------------
> >
> > > -----Original Message-----
> > > From: David Marchand <david.marchand at redhat.com>
> > > Sent: Wednesday, October 9, 2019 03:47
> > > To: Aaron Conole <aconole at redhat.com>
> > > Cc: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang at arm.com>;
> > > David Hunt <david.hunt at intel.com>; dev <dev at dpdk.org>;
> > > hkalra at marvell.com; Gavin Hu (Arm Technology China)
> > > <Gavin.Hu at arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>; dpdk stable
> > > <stable at dpdk.org>
> > > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix
> > > deadlock issue for aarch64
> > >
> > > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole at redhat.com>
> wrote:
> > > >
> > > > Ruifeng Wang <ruifeng.wang at arm.com> writes:
> > > >
> > > > > Distributor and worker threads rely on data structs in cache
> > > > > line for synchronization. The shared data structs were not protected.
> > > > > This caused deadlock issue on weaker memory ordering platforms
> > > > > as aarch64.
> > > > > Fix this issue by adding memory barriers to ensure
> > > > > synchronization among cores.
> > > > >
> > > > > Bugzilla ID: 342
> > > > > Fixes: 775003ad2f96 ("distributor: add new burst-capable
> > > > > library")
> > > > > Cc: stable at dpdk.org
> > > > >
> > > > > Signed-off-by: Ruifeng Wang <ruifeng.wang at arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu at arm.com>
> > > > > ---
> > > >
> > > > I see a failure in the distributor_autotest (on one of the builds):
> > > >
> > > > 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status
> 255
> > > or signal 127 SIGinvalid)
> > > >
> > > > --- command ---
> > > >
> > > > DPDK_TEST='distributor_autotest'
> > > > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
> > > > --file-prefix=distributor_autotest
> > > >
> > > > --- stdout ---
> > > >
> > > > EAL: Probing VFIO support...
> > > >
> > > > APP: HPET is not enabled, using TSC as default timer
> > > >
> > > > RTE>>distributor_autotest
> > > >
> > > > === Basic distributor sanity tests ===
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with all zero hashes done.
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with non-zero hashes done
> > > >
> > > > === testing big burst (single) ===
> > > >
> > > > Sanity test of returned packets done
> > > >
> > > > === Sanity test with mbuf alloc/free (single) ===
> > > >
> > > > Sanity test with mbuf alloc/free passed
> > > >
> > > > Too few cores to run worker shutdown test
> > > >
> > > > === Basic distributor sanity tests ===
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with all zero hashes done.
> > > >
> > > > Worker 0 handled 32 packets
> > > >
> > > > Sanity test with non-zero hashes done
> > > >
> > > > === testing big burst (burst) ===
> > > >
> > > > Sanity test of returned packets done
> > > >
> > > > === Sanity test with mbuf alloc/free (burst) ===
> > > >
> > > > Line 326: Packet count is incorrect, 1048568, expected 1048576
> > > >
> > > > Test Failed
> > > >
> > > > RTE>>
> > > >
> > > > --- stderr ---
> > > >
> > > > EAL: Detected 2 lcore(s)
> > > >
> > > > EAL: Detected 1 NUMA nodes
> > > >
> > > > EAL: Multi-process socket
> > > > /var/run/dpdk/distributor_autotest/mp_socket
> > > >
> > > > EAL: Selected IOVA mode 'PA'
> > > >
> > > > EAL: No available hugepages reported in hugepages-1048576kB
> > > >
> > > > -------
> > > >
> > > > Not sure how to help debug further.  I'll re-start the job to see
> > > > if it 'clears' up - but I guess there may be a delicate
> > > > synchronization somewhere that needs to be accounted.
> > >
> > > Idem, and with the same loop I used before, it can be caught quickly.
> > >
> > > # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> > > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level
> > > |*:8
> > > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log;
> > > rm -f $log)
> > >
> > Thanks Aaron and David for your report. I can reproduce this issue with the
> script.
> > Will fix it in next version.
> >
> > > [snip]
> > >
> > > RTE>>distributor_autotest
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: request: mp_malloc_sync
> > > EAL: Heap on socket 0 was expanded by 2MB
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous
> > > space
> > > EAL: Trying to obtain current memory policy.
> > > EAL: Setting policy MPOL_PREFERRED for socket 0
> > > EAL: Restoring previous memory policy: 0
> > > EAL: request: mp_malloc_sync
> > > EAL: Heap on socket 0 was expanded by 8MB === Basic distributor
> > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero
> hashes done.
> > > Worker 0 handled 32 packets
> > > Sanity test with non-zero hashes done === testing big burst (single)
> > > === Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (single) === Sanity test with
> > > mbuf alloc/free passed
> > >
> > > Too few cores to run worker shutdown test === Basic distributor
> > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero
> hashes done.
> > > Worker 0 handled 32 packets
> > > Sanity test with non-zero hashes done === testing big burst (burst)
> > > === Sanity test of returned packets done
> > >
> > > === Sanity test with mbuf alloc/free (burst) === Line 326: Packet
> > > count is incorrect, 1048568, expected 1048576 Test Failed
> > > RTE>>
> > > real    0m36.668s
> > > user    1m7.293s
> > > sys    0m1.560s
> > >
> > > Could be worth running this loop on all tests? (not talking about
> > > the CI, it would be a manual effort to catch lurking issues).
> > >
> > >
> > > --
> > > David Marchand


More information about the dev mailing list