[dpdk-dev] Random failure in service_autotest
Aaron Conole
aconole at redhat.com
Sat Jul 18 00:38:43 CEST 2020
Lukasz Wojciechowski <l.wojciechow at partner.samsung.com> writes:
> W dniu 17.07.2020 o 17:19, David Marchand pisze:
>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
>> <david.marchand at redhat.com> wrote:
>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit at intel.com> wrote:
>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
>>>>> Hello Harry and guys who touched the service code recently :-)
>>>>>
>>>>> I spotted a failure for the service UT in Travis:
>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>>>>>
>>>>> I found only a single instance of this failure and tried to reproduce
>>>>> it with my usual "brute" active loop with no success so far.
>>>> +1, I didn't able to reproduce it in my environment but observed it in the
>>>> Travis CI.
>>>>
>>>>> Any chance it could be due to recent changes?
>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc47a31ce4e-231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc47a31ce4e-d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>>> I can see more occurrences of the issue in the CI.
>>> I just applied the patch changing the log level for test assert, in
>>> the hope it will help.
>> And... we just got one with logs:
>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
>>
>> EAL: Test assert service_lcore_attr_get line 396 failed:
>> lcore_attr_get() didn't get correct loop count (zero)
>>
>> It looks like a race between the service core still running and the
>> core resetting the loops attr.
>>
> Yes, it seems to be just lack of patience of the test. It should wait a
> bit for lcore to stop before resetting attrs.
> Something like this should help:
> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
>
> rte_service_lcore_stop(slcore_id);
>
> + /* wait for the service lcore to stop */
> + rte_delay_ms(200);
> +
> TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
> "Valid lcore_attr_reset_all() didn't return
> success");
Would an rte_eal_wait_lcore make sense? Overall, I really dislike
sleeps because they can hide racy synchronization points.
More information about the dev
mailing list