rte_service unit test failing randomly
    Morten Brørup 
    mb at smartsharesystems.com
       
    Thu Oct  6 08:53:32 CEST 2022
    
    
  
> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
> Sent: Wednesday, 5 October 2022 23.34
> 
> On 2022-10-05 22:52, Thomas Monjalon wrote:
> > 05/10/2022 22:33, Mattias Rönnblom:
> >> On 2022-10-05 21:14, David Marchand wrote:
> >>> Hello,
> >>>
> >>> The service_autotest unit test has been failing randomly.
> >>> This is not something new.
[...]
> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
> Service
> >>> not stopped after 100ms
> >>>
> >>> Ideas?
> >>>
> >>>
> >>> Thanks.
> >>
> >> Do you run the test suite in a controlled environment? I.e., one
> where
> >> you can trust that the lcore threads aren't interrupted for long
> periods
> >> of time.
> >>
> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
> the
> >> CPU with other threads.
> >
> > You mean the tests cannot be interrupted?
> 
> I just took a very quick look, but it seems like the main thread can,
> but the worker lcore thread cannot be interrupt for anything close to
> 100 ms, or you risk a test failure.
> 
> > Then it looks very fragile.
> 
> Tests like this are by their very nature racey. If a test thread sends
> a
> request to another thread, there is no way for it to decide when a
> non-response should result in a test failure, unless the scheduling
> latency of the receiving thread has an upper bound.
> 
> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
> a lot of matches. I bet there more like the service core one, but they
> allow for longer interruptions.
> 
> That said, 100 ms sounds like very short. I don't see why this can be a
> lot longer.
> 
> ...and that said, I would argue you still need a reasonably controlled
> environment for the autotests. If you have a server is arbitrarily
> overloaded, maybe also with high memory pressure (and associated
> instruction page faults and god-knows-what), the real-world worst-case
> interruptions could be very long indeed. Seconds. Designing inherently
> racey tests for that kind of environment will make them have very long
> run times.
Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/
    
    
More information about the dev
mailing list