[dpdk-dev] Random failure in service_autotest

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Sat Jul 18 00:43:46 CEST 2020


<snip>

> Subject: Re: [dpdk-dev] Random failure in service_autotest
> 
> Lukasz Wojciechowski <l.wojciechow at partner.samsung.com> writes:
> 
> > W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >> <david.marchand at redhat.com> wrote:
> >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit at intel.com>
> wrote:
> >>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>
> >>>>> I spotted a failure for the service UT in Travis:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> >>>>>
> >>>>> I found only a single instance of this failure and tried to
> >>>>> reproduce it with my usual "brute" active loop with no success so far.
> >>>> +1, I didn't able to reproduce it in my environment but observed it
> >>>> +in the
> >>>> Travis CI.
> >>>>
> >>>>> Any chance it could be due to recent changes?
> >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc4
> >>>>> 7a31ce4e-
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%
> >>>>> 2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc4
> >>>>> 7a31ce4e-
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%
> >>>>> 2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>> I can see more occurrences of the issue in the CI.
> >>> I just applied the patch changing the log level for test assert, in
> >>> the hope it will help.
> >> And... we just got one with logs:
> >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>
> >> EAL: Test assert service_lcore_attr_get line 396 failed:
> >> lcore_attr_get() didn't get correct loop count (zero)
> >>
> >> It looks like a race between the service core still running and the
> >> core resetting the loops attr.
> >>
> > Yes, it seems to be just lack of patience of the test. It should wait
> > a bit for lcore to stop before resetting attrs.
> > Something like this should help:
> > @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >
> >          rte_service_lcore_stop(slcore_id);
> >
> > +       /* wait for the service lcore to stop */
> > +       rte_delay_ms(200);
> > +
> >          TEST_ASSERT_EQUAL(0,
> > rte_service_lcore_attr_reset_all(slcore_id),
> >                            "Valid lcore_attr_reset_all() didn't return
> > success");
> 
> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike sleeps
> because they can hide racy synchronization points.
I think something like below might be better.

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..f0bedbe5e 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -384,6 +384,16 @@ service_lcore_attr_get(void)

        rte_service_lcore_stop(slcore_id);

+       /* give the service 200ms to stop running */
+       for (i = 0; i < 200; i++) {
+               if (!rte_service_may_be_active(sid))
+                       break;
+               rte_delay_ms(SERVICE_DELAY);
+       }
+
+       TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
+                         "Error: Service not stopped after 200ms");
+
        TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
                          "Valid lcore_attr_reset_all() didn't return success");




More information about the dev mailing list