[dpdk-dev] Random failure in service_autotest
Lukasz Wojciechowski
l.wojciechow at partner.samsung.com
Mon Jul 20 14:47:25 CEST 2020
W dniu 20.07.2020 o 14:09, Van Haaren, Harry pisze:
>> -----Original Message-----
>> From: Phil Yang <Phil.Yang at arm.com>
>> Sent: Saturday, July 18, 2020 9:34 AM
>> To: Aaron Conole <aconole at redhat.com>; Lukasz Wojciechowski
>> <l.wojciechow at partner.samsung.com>
>> Cc: David Marchand <david.marchand at redhat.com>; Van Haaren, Harry
>> <harry.van.haaren at intel.com>; Igor Romanov <igor.romanov at oktetlabs.ru>;
>> Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; dev
>> <dev at dpdk.org>; Yigit, Ferruh <ferruh.yigit at intel.com>; nd <nd at arm.com>; nd
>> <nd at arm.com>
>> Subject: RE: [dpdk-dev] Random failure in service_autotest
>>
>>> -----Original Message-----
>>> From: Aaron Conole <aconole at redhat.com>
>>> Sent: Saturday, July 18, 2020 6:39 AM
>>> To: Lukasz Wojciechowski <l.wojciechow at partner.samsung.com>
>>> Cc: David Marchand <david.marchand at redhat.com>; Van Haaren Harry
>>> <harry.van.haaren at intel.com>; Igor Romanov
>>> <igor.romanov at oktetlabs.ru>; Honnappa Nagarahalli
>>> <Honnappa.Nagarahalli at arm.com>; Phil Yang <Phil.Yang at arm.com>; dev
>>> <dev at dpdk.org>; Ferruh Yigit <ferruh.yigit at intel.com>
>>> Subject: Re: [dpdk-dev] Random failure in service_autotest
>>>
>>> Lukasz Wojciechowski <l.wojciechow at partner.samsung.com> writes:
>>>
>>>> W dniu 17.07.2020 o 17:19, David Marchand pisze:
>>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
>>>>> <david.marchand at redhat.com> wrote:
>>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit at intel.com>
>>> wrote:
>>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
>>>>>>>> Hello Harry and guys who touched the service code recently :-)
>>>>>>>>
>>>>>>>> I spotted a failure for the service UT in Travis:
>>>>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>>>>>>>>
>>>>>>>> I found only a single instance of this failure and tried to reproduce
>>>>>>>> it with my usual "brute" active loop with no success so far.
>>>>>>> +1, I didn't able to reproduce it in my environment but observed it in
>>> the
>>>>>>> Travis CI.
>>>>>>>
>>>>>>>> Any chance it could be due to recent changes?
>>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
>>> 0cc47a31ce4e-
>>> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
>>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
>>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
>>> 0cc47a31ce4e-
>>> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm
>>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>>>>>> I can see more occurrences of the issue in the CI.
>>>>>> I just applied the patch changing the log level for test assert, in
>>>>>> the hope it will help.
>>>>> And... we just got one with logs:
>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
>>>>>
>>>>> EAL: Test assert service_lcore_attr_get line 396 failed:
>>>>> lcore_attr_get() didn't get correct loop count (zero)
>>>>>
>>>>> It looks like a race between the service core still running and the
>>>>> core resetting the loops attr.
>>>>>
>>>> Yes, it seems to be just lack of patience of the test. It should wait a
>>>> bit for lcore to stop before resetting attrs.
>>>> Something like this should help:
>>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
>>>>
>>>> rte_service_lcore_stop(slcore_id);
>>>>
>>>> + /* wait for the service lcore to stop */
>>>> + rte_delay_ms(200);
>>>> +
>>>> TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>>>> "Valid lcore_attr_reset_all() didn't return
>>>> success");
>>> Would an rte_eal_wait_lcore make sense? Overall, I really dislike
>>> sleeps because they can hide racy synchronization points.
>>
>> The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED.
>> However, it will not terminate the service_run() procedure (there is a spinlock in
>> service_run() MT_UNSAFE path).
>> So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(),
>> which leads to this failure.
>> I think if we move the loop counter update operation before the service_run()
>> procedure, it can avoid this conflict.
>>
>> I cannot reproduce this issue on my testbed, so not sure something like this can
>> help or not.
>> Please check the code below.
>>
>> diff --git a/lib/librte_eal/common/rte_service.c
>> b/lib/librte_eal/common/rte_service.c
>> index 6a0e0ff..7b703dd 100644
>> --- a/lib/librte_eal/common/rte_service.c
>> +++ b/lib/librte_eal/common/rte_service.c
>> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
>> while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
>> RUNSTATE_RUNNING) {
>> const uint64_t service_mask = cs->service_mask;
>> + cs->loops++;
>>
>> for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>> if (!service_valid(i))
>> @@ -471,8 +472,6 @@ service_runner_func(void *arg)
>> /* return value ignored as no change to code flow */
>> service_run(i, cs, service_mask, service_get(i), 1);
>> }
>> -
>> - cs->loops++;
>> }
>>
>> return 0;
> Changing the implementation loop counting is one option - but changing the
> implementation as a workaround for a unit test seems like the wrong way around.
I agree. We should fix the test not the service implementation. Of
course it would be nice to do so without inserting sleeps as it's a
workaround for true synchronization.
>
> Honnappa's suggestion in the other reply seems to not work here, as the "service_may_be_active"
> won't get updated if the service core is stopped after running its final iteration..?
>
> I'd prefer to see the implementation actually improve, or workaround in the test code itself.
> For improving the implementation, I can suggest this patch: https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-0cc47a31c8b4-f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F74493%2F
>
>
> That leaves us with 3 options;
> 1) Add a sleep, allowing the service core to finish its delay() as part of the service its running, and give it some grace time to finish
I'm not a fan of sleeps, but the test code is already full of them ant
this would be the easiest fix.
> 2) Relax the check in the unit test, to allow for value <= (expected + 1) which would ignore the last cs->loops++; call.
> (Another way to think about option 2, is to allow for the non-determinism of the threading into the test).
I don't like this one.
> 3) Use the linked patch to implement/add/fix a wait time in the stop() API
The patch adds new functionality to the public API function. I don't
know if's a good step. And also I don't like the solution. I think that
defining a 1000*1ms = 1s time for service to finish is also bad as we
define some static number in the code.
My proposal is to keep the thread_active flag and add a new function for
checking it. This way the final user (in our case test app) would be
able to do checking that flag in a loop by itself. The final user who
delegated service to be run on lcore knows what type of service it is
and how long should it wait for the service to stop.
Or we can provide a synchronous function using conditionals to know if
the service has stopped, but that would also require to change the
behavior of API.
>
> Strong options either way?
>
> Regards, -Harry
--
Lukasz Wojciechowski
Principal Software Engineer
Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow at partner.samsung.com
More information about the dev
mailing list