rte_service unit test failing randomly
Mattias Rönnblom
mattias.ronnblom at ericsson.com
Wed Oct 5 23:33:56 CEST 2022
On 2022-10-05 22:52, Thomas Monjalon wrote:
> 05/10/2022 22:33, Mattias Rönnblom:
>> On 2022-10-05 21:14, David Marchand wrote:
>>> Hello,
>>>
>>> The service_autotest unit test has been failing randomly.
>>> This is not something new.
>>> We have been fixing this unit test and the service code, here and there.
>>> For some time we were "fine": the failures were rare.
>>>
>>> But recenly (for the last two weeks at least), it started failing more
>>> frequently in UNH lab.
>>>
>>> The symptoms are linked to places where the unit test code is "waiting
>>> for some time":
>>>
>>> - service_lcore_attr_get:
>>> + TestCase [ 5] : service_lcore_attr_get failed
>>> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
>>> not stopped after waiting.
>>>
>>>
>>> - service_may_be_active:
>>> + TestCase [15] : service_may_be_active failed
>>> ...
>>> EAL: Test assert service_may_be_active line 960 failed: Error: Service
>>> not stopped after 100ms
>>>
>>> Ideas?
>>>
>>>
>>> Thanks.
>>
>> Do you run the test suite in a controlled environment? I.e., one where
>> you can trust that the lcore threads aren't interrupted for long periods
>> of time.
>>
>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the
>> CPU with other threads.
>
> You mean the tests cannot be interrupted?
I just took a very quick look, but it seems like the main thread can,
but the worker lcore thread cannot be interrupt for anything close to
100 ms, or you risk a test failure.
> Then it looks very fragile.
Tests like this are by their very nature racey. If a test thread sends a
request to another thread, there is no way for it to decide when a
non-response should result in a test failure, unless the scheduling
latency of the receiving thread has an upper bound.
If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
a lot of matches. I bet there more like the service core one, but they
allow for longer interruptions.
That said, 100 ms sounds like very short. I don't see why this can be a
lot longer.
...and that said, I would argue you still need a reasonably controlled
environment for the autotests. If you have a server is arbitrarily
overloaded, maybe also with high memory pressure (and associated
instruction page faults and god-knows-what), the real-world worst-case
interruptions could be very long indeed. Seconds. Designing inherently
racey tests for that kind of environment will make them have very long
run times.
> Please could help making it more robust?
>
I can send a patch, if Harry can't.
More information about the dev
mailing list