[dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
Van Haaren, Harry
harry.van.haaren at intel.com
Mon Oct 14 18:48:57 CEST 2019
> -----Original Message-----
> From: Aaron Conole [mailto:aconole at redhat.com]
> Sent: Monday, October 14, 2019 3:54 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: David Marchand <david.marchand at redhat.com>; dev at dpdk.org
> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest failing
>
> Aaron Conole <aconole at redhat.com> writes:
>
> > "Van Haaren, Harry" <harry.van.haaren at intel.com> writes:
> >
> >>> -----Original Message-----
> >>> From: Aaron Conole [mailto:aconole at redhat.com]
> >>> Sent: Wednesday, September 4, 2019 8:56 PM
> >>> To: David Marchand <david.marchand at redhat.com>
> >>> Cc: Van Haaren, Harry <harry.van.haaren at intel.com>; dev at dpdk.org
> >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from
> service_autotest
> >>> failing
<snip lots of backlog>
> >>> > real 2m42.884s
> >>> > user 5m1.902s
> >>> > sys 0m2.208s
> >>>
> >>> I can confirm - takes about 1m to fail.
> >>
> >>
> >> Hi Aaron and David,
> >>
> >> I've been attempting to reproduce this, still no errors here.
> >>
> >> Given the nature of service-cores, and the difficulty to reproduce
> >> here this feels like a race-condition - one that may not exist in all
> >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
> >>
> >> I'm using Meson to build, so reproducing using this instead of the
> command
> >> as provided above. There should be no difference in reproducing due to
> this:
> >
> > The command runs far more iterations than meson does (I think).
> >
> > I still see it periodically occur in the travis environment.
> >
> > I did see at least one missing memory barrier (I believe). Please
> > review the following code change (and if you agree I can submit it
> > formally):
> >
> > -----
> > --- a/lib/librte_eal/common/eal_common_launch.c
> > +++ b/lib/librte_eal/common/eal_common_launch.c
> > @@ -21,8 +21,10 @@
> > int
> > rte_eal_wait_lcore(unsigned slave_id)
> > {
> > - if (lcore_config[slave_id].state == WAIT)
> > + if (lcore_config[slave_id].state == WAIT) {
> > + rte_rmb();
> > return 0;
> > + }
> >
> > while (lcore_config[slave_id].state != WAIT &&
> > lcore_config[slave_id].state != FINISHED)
> > -----
> >
> > This is because in lib/librte_eal/linux/eal/eal_thread.c:
> >
> > -----
> > /* when a service core returns, it should go directly to WAIT
> > * state, because the application will not lcore_wait() for it.
> > */
> > if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> > lcore_config[lcore_id].state = WAIT;
> > else
> > lcore_config[lcore_id].state = FINISHED;
> > -----
> >
> > NOTE that the service core skips the rte_eal_wait_lcore() code from
> > making the FINISHED->WAIT transition. So I think at least that read
> > barrier will be needed (maybe I miss the pairing, though?).
> >
> > Additionally, I'm wondering if there is an additional write or sync
> > barrier needed to ensure that some of the transitions are properly
> > recorded when using lcore as a service lcore function. The fact that
> > this only happens occasionally tells me that it's either a race (which
> > is possible... because the variable update in the test might not be
> > sync'd across cores or something), or some other missing
> > synchronization.
> >
> >> $ meson test service_autotest --repeat 50
> >>
> >> 1/1 DPDK:fast-tests / service_autotest OK 3.86 s
> >> 1/1 DPDK:fast-tests / service_autotest OK 3.87 s
> >> ...
> >> 1/1 DPDK:fast-tests / service_autotest OK 3.84 s
> >>
> >> OK: 50
> >> FAIL: 0
> >> SKIP: 0
> >> TIMEOUT: 0
> >>
> >> I'll keep it running for a few hours but I have little faith if it only
> >> takes 1 minute on your machines...
> >
> > Please try the flat command.
>
> Not sure if you've had any time to look at this.
Apologies for delay in response - I've ran the existing tests a few 1000's of times during the week, with one reproduction. That's not enough for confidence in debug/fix for me.
> I think there's a change we can make, but not sure about how it fits in
> the overall service lcore design.
This suggestion is only changing the test code correct?
> The proposal is to use a pthread_cond variable which blocks the thread
> requesting the service function to run. The service function merely
> sets the condition. The requesting thread does a timed wait (up to 5s?)
> and if the timeout is exceeded can throw an error. Otherwise, it will
> unblock and can assume that the test passes. WDYT? I think it works
> better than the racy code in the test case for now.
The idea/concept is right above, but I think that's what the test is
approximating anyway? The main thread does an "mp_wait_lcore()" until
the service core has returned, essentially a blocking call.
The test fails if the flag is not == 1 (as that indidcates failure in launching
an application function on a previously-use-as-service-core lthread).
I think your RMB suggestion is likely to be the correct, but I'd like to dig into it a bit more.
Thanks for the ping on this thread.
More information about the dev
mailing list