[dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing

Aaron Conole aconole at redhat.com
Mon Oct 7 14:38:41 CEST 2019


"Van Haaren, Harry" <harry.van.haaren at intel.com> writes:

>> -----Original Message-----
>> From: Aaron Conole [mailto:aconole at redhat.com]
>> Sent: Wednesday, September 4, 2019 8:56 PM
>> To: David Marchand <david.marchand at redhat.com>
>> Cc: Van Haaren, Harry <harry.van.haaren at intel.com>; dev at dpdk.org
>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
>> failing
>> 
>> David Marchand <david.marchand at redhat.com> writes:
>> 
>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
>> > <david.marchand at redhat.com> wrote:
>> >>
>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>> >> <harry.van.haaren at intel.com> wrote:
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: Aaron Conole [mailto:aconole at redhat.com]
>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
>> >> > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
>> >> > > Cc: dev at dpdk.org
>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>> >> > >
>> >> > > Hi Harry,
>> >> >
>> >> > Hey Aaron,
>> >> >
>> >> > > I noticed as part of series_6218
>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
>> travis
>> >> > > build had a single failure, in service_autotest but it doesn't seem
>> >> > > related to the series at all.
>> >> > >
>> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>> >> > >
>> >> > > Not sure if there's some kind of debugging we can add or look at to
>> help
>> >> > > diagnose failures when they occur.  Do you have time to have a look?
>> >> >
>> >> > Thanks for flagging this.
>> >> >
>> >> > I've just re-run the unit tests here multiple times to see if I can
>> >> > reproduce something strange, no luck on reproducing the issue.
>> >> >
>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>> >> > still no issues found.
>> >> >
>> >> > Building with Clang-7 and Shared libs (instead of default static)
>> >> > still no issues found.
>> >> >
>> >> > If somebody can reproduce please send an update to here and I'll
>> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
>> >>
>> >> You have to be patient, but I caught it on my laptop:
>> >>
>> >
>> > Ok, and now with the logs:
>> >
>> >
>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
>> > $log)
>> > EAL: Detected lcore 0 as core 0 on socket 0
>> > EAL: Detected lcore 1 as core 1 on socket 0
>> > EAL: Detected lcore 2 as core 2 on socket 0
> <snip verbose logs>
>> > EAL:   probe driver: 8086:15d7 net_e1000_em
>> > EAL:   Not managed by a supported kernel driver, skipped
>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
>> > APP: HPET is not enabled, using TSC as default timer
>> > RTE>>service_autotest
>> >  + ------------------------------------------------------- +
>> >  + Test Suite : service core test suite
>> >  + ------------------------------------------------------- +
>> >  + TestCase [ 0] : unregister_all succeeded
>> >  + TestCase [ 1] : service_name succeeded
>> >  + TestCase [ 2] : service_get_by_name succeeded
>> > Service dummy_service Summary
>> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
>> > Service dummy_service Summary
>> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>> >  + TestCase [ 3] : service_dump succeeded
>> >  + TestCase [ 4] : service_attr_get succeeded
>> >  + TestCase [ 5] : service_lcore_attr_get succeeded
>> >  + TestCase [ 6] : service_probe_capability succeeded
>> >  + TestCase [ 7] : service_start_stop succeeded
>> >  + TestCase [ 8] : service_lcore_add_del skipped
>> >  + TestCase [ 9] : service_lcore_start_stop succeeded
>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
>> > core function call had no effect.
>> >  + TestCase [10] : service_lcore_en_dis_able failed
>> >  + TestCase [11] : service_mt_unsafe_poll skipped
>> >  + TestCase [12] : service_mt_safe_poll skipped
>> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
>> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>> >  + TestCase [15] : service_may_be_active succeeded
>> >  + ------------------------------------------------------- +
>> >  + Test Suite Summary
>> >  + Tests Total :       16
>> >  + Tests Skipped :      3
>> >  + Tests Executed :    16
>> >  + Tests Unsupported:   0
>> >  + Tests Passed :      12
>> >  + Tests Failed :       1
>> >  + ------------------------------------------------------- +
>> > Test Failed
>> > RTE>>EAL: request: mp_malloc_sync
>> > EAL: Heap on socket 0 was shrunk by 2MB
>> >
>> > real    2m42.884s
>> > user    5m1.902s
>> > sys    0m2.208s
>> 
>> I can confirm - takes about 1m to fail.
>
>
> Hi Aaron and David,
>
> I've been attempting to reproduce this, still no errors here.
>
> Given the nature of service-cores, and the difficulty to reproduce
> here this feels like a race-condition - one that may not exist in all
> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
>
> I'm using Meson to build, so reproducing using this instead of the command
> as provided above. There should be no difference in reproducing due to this:

The command runs far more iterations than meson does (I think).

I still see it periodically occur in the travis environment.

I did see at least one missing memory barrier (I believe).  Please
review the following code change (and if you agree I can submit it
formally):

-----
--- a/lib/librte_eal/common/eal_common_launch.c
+++ b/lib/librte_eal/common/eal_common_launch.c
@@ -21,8 +21,10 @@
 int
 rte_eal_wait_lcore(unsigned slave_id)
 {
-       if (lcore_config[slave_id].state == WAIT)
+       if (lcore_config[slave_id].state == WAIT) {
+               rte_rmb();
                return 0;
+       }
 
        while (lcore_config[slave_id].state != WAIT &&
               lcore_config[slave_id].state != FINISHED)
-----

This is because in lib/librte_eal/linux/eal/eal_thread.c:

-----
		/* when a service core returns, it should go directly to WAIT
		 * state, because the application will not lcore_wait() for it.
		 */
		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
			lcore_config[lcore_id].state = WAIT;
		else
			lcore_config[lcore_id].state = FINISHED;
-----

NOTE that the service core skips the rte_eal_wait_lcore() code from
making the FINISHED->WAIT transition.  So I think at least that read
barrier will be needed (maybe I miss the pairing, though?).

Additionally, I'm wondering if there is an additional write or sync
barrier needed to ensure that some of the transitions are properly
recorded when using lcore as a service lcore function.  The fact that
this only happens occasionally tells me that it's either a race (which
is possible... because the variable update in the test might not be
sync'd across cores or something), or some other missing
synchronization.

> $ meson test service_autotest --repeat 50
>
> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
> ...
> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
>
> OK:        50
> FAIL:       0
> SKIP:       0
> TIMEOUT:    0
>
> I'll keep it running for a few hours but I have little faith if it only
> takes 1 minute on your machines...

Please try the flat command.

> Regards, -Harry


More information about the dev mailing list