[dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing

Aaron Conole aconole at redhat.com
Mon Oct 14 16:53:48 CEST 2019


Aaron Conole <aconole at redhat.com> writes:

> "Van Haaren, Harry" <harry.van.haaren at intel.com> writes:
>
>>> -----Original Message-----
>>> From: Aaron Conole [mailto:aconole at redhat.com]
>>> Sent: Wednesday, September 4, 2019 8:56 PM
>>> To: David Marchand <david.marchand at redhat.com>
>>> Cc: Van Haaren, Harry <harry.van.haaren at intel.com>; dev at dpdk.org
>>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
>>> failing
>>> 
>>> David Marchand <david.marchand at redhat.com> writes:
>>> 
>>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
>>> > <david.marchand at redhat.com> wrote:
>>> >>
>>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>>> >> <harry.van.haaren at intel.com> wrote:
>>> >> >
>>> >> > > -----Original Message-----
>>> >> > > From: Aaron Conole [mailto:aconole at redhat.com]
>>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
>>> >> > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
>>> >> > > Cc: dev at dpdk.org
>>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>>> >> > >
>>> >> > > Hi Harry,
>>> >> >
>>> >> > Hey Aaron,
>>> >> >
>>> >> > > I noticed as part of series_6218
>>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
>>> travis
>>> >> > > build had a single failure, in service_autotest but it doesn't seem
>>> >> > > related to the series at all.
>>> >> > >
>>> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>>> >> > >
>>> >> > > Not sure if there's some kind of debugging we can add or look at to
>>> help
>>> >> > > diagnose failures when they occur.  Do you have time to have a look?
>>> >> >
>>> >> > Thanks for flagging this.
>>> >> >
>>> >> > I've just re-run the unit tests here multiple times to see if I can
>>> >> > reproduce something strange, no luck on reproducing the issue.
>>> >> >
>>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>>> >> > still no issues found.
>>> >> >
>>> >> > Building with Clang-7 and Shared libs (instead of default static)
>>> >> > still no issues found.
>>> >> >
>>> >> > If somebody can reproduce please send an update to here and I'll
>>> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
>>> >>
>>> >> You have to be patient, but I caught it on my laptop:
>>> >>
>>> >
>>> > Ok, and now with the logs:
>>> >
>>> >
>>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
>>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
>>> > $log)
>>> > EAL: Detected lcore 0 as core 0 on socket 0
>>> > EAL: Detected lcore 1 as core 1 on socket 0
>>> > EAL: Detected lcore 2 as core 2 on socket 0
>> <snip verbose logs>
>>> > EAL:   probe driver: 8086:15d7 net_e1000_em
>>> > EAL:   Not managed by a supported kernel driver, skipped
>>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
>>> > APP: HPET is not enabled, using TSC as default timer
>>> > RTE>>service_autotest
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite : service core test suite
>>> >  + ------------------------------------------------------- +
>>> >  + TestCase [ 0] : unregister_all succeeded
>>> >  + TestCase [ 1] : service_name succeeded
>>> >  + TestCase [ 2] : service_get_by_name succeeded
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>>> >  + TestCase [ 3] : service_dump succeeded
>>> >  + TestCase [ 4] : service_attr_get succeeded
>>> >  + TestCase [ 5] : service_lcore_attr_get succeeded
>>> >  + TestCase [ 6] : service_probe_capability succeeded
>>> >  + TestCase [ 7] : service_start_stop succeeded
>>> >  + TestCase [ 8] : service_lcore_add_del skipped
>>> >  + TestCase [ 9] : service_lcore_start_stop succeeded
>>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
>>> > core function call had no effect.
>>> >  + TestCase [10] : service_lcore_en_dis_able failed
>>> >  + TestCase [11] : service_mt_unsafe_poll skipped
>>> >  + TestCase [12] : service_mt_safe_poll skipped
>>> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
>>> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>>> >  + TestCase [15] : service_may_be_active succeeded
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite Summary
>>> >  + Tests Total :       16
>>> >  + Tests Skipped :      3
>>> >  + Tests Executed :    16
>>> >  + Tests Unsupported:   0
>>> >  + Tests Passed :      12
>>> >  + Tests Failed :       1
>>> >  + ------------------------------------------------------- +
>>> > Test Failed
>>> > RTE>>EAL: request: mp_malloc_sync
>>> > EAL: Heap on socket 0 was shrunk by 2MB
>>> >
>>> > real    2m42.884s
>>> > user    5m1.902s
>>> > sys    0m2.208s
>>> 
>>> I can confirm - takes about 1m to fail.
>>
>>
>> Hi Aaron and David,
>>
>> I've been attempting to reproduce this, still no errors here.
>>
>> Given the nature of service-cores, and the difficulty to reproduce
>> here this feels like a race-condition - one that may not exist in all
>> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
>>
>> I'm using Meson to build, so reproducing using this instead of the command
>> as provided above. There should be no difference in reproducing due to this:
>
> The command runs far more iterations than meson does (I think).
>
> I still see it periodically occur in the travis environment.
>
> I did see at least one missing memory barrier (I believe).  Please
> review the following code change (and if you agree I can submit it
> formally):
>
> -----
> --- a/lib/librte_eal/common/eal_common_launch.c
> +++ b/lib/librte_eal/common/eal_common_launch.c
> @@ -21,8 +21,10 @@
>  int
>  rte_eal_wait_lcore(unsigned slave_id)
>  {
> -       if (lcore_config[slave_id].state == WAIT)
> +       if (lcore_config[slave_id].state == WAIT) {
> +               rte_rmb();
>                 return 0;
> +       }
>  
>         while (lcore_config[slave_id].state != WAIT &&
>                lcore_config[slave_id].state != FINISHED)
> -----
>
> This is because in lib/librte_eal/linux/eal/eal_thread.c:
>
> -----
> 		/* when a service core returns, it should go directly to WAIT
> 		 * state, because the application will not lcore_wait() for it.
> 		 */
> 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> 			lcore_config[lcore_id].state = WAIT;
> 		else
> 			lcore_config[lcore_id].state = FINISHED;
> -----
>
> NOTE that the service core skips the rte_eal_wait_lcore() code from
> making the FINISHED->WAIT transition.  So I think at least that read
> barrier will be needed (maybe I miss the pairing, though?).
>
> Additionally, I'm wondering if there is an additional write or sync
> barrier needed to ensure that some of the transitions are properly
> recorded when using lcore as a service lcore function.  The fact that
> this only happens occasionally tells me that it's either a race (which
> is possible... because the variable update in the test might not be
> sync'd across cores or something), or some other missing
> synchronization.
>
>> $ meson test service_autotest --repeat 50
>>
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
>> ...
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
>>
>> OK:        50
>> FAIL:       0
>> SKIP:       0
>> TIMEOUT:    0
>>
>> I'll keep it running for a few hours but I have little faith if it only
>> takes 1 minute on your machines...
>
> Please try the flat command.

Not sure if you've had any time to look at this.

I think there's a change we can make, but not sure about how it fits in
the overall service lcore design.

The proposal is to use a pthread_cond variable which blocks the thread
requesting the service function to run.  The service function merely
sets the condition.  The requesting thread does a timed wait (up to 5s?)
and if the timeout is exceeded can throw an error.  Otherwise, it will
unblock and can assume that the test passes.  WDYT?  I think it works
better than the racy code in the test case for now.

>> Regards, -Harry


More information about the dev mailing list