[dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Wed Feb 3 17:24:33 CET 2021


<snip>

> 
> Hi Feifei,
> 
> >
> > Hi, Honnappa, Konstantin and Stephen
> >
> > Thanks very much for your attention of this patch. Based on your
> > opinion, Ruifeng and I discuss about this and make a summary:
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> > 						main thread
> 				worker thread
> > rte_eal_remote_launch:
> > 	[ Honnappa focus ]
> >
> > 	To ensure f can load correct arg,
> >
> > 	arg store should before f
> > 						lcore_config[worker_id].f = f;
> > 						lcore_config[worker_id].arg =
> arg;
> > 						wmb()? or store-relase on f?
> >
> >
> 	eal_thread_loop:
> > 						pipeline_communication
> 	---------------------->	pipeline_communication
> >
> 				if (lcore_config[lcore_id].f ==
> > NULL)
> >
> 				rte_panic("NULL function
> > pointer\n");
> >
> >
> 				fct_arg =
> > lcore_config[lcore_id].arg;
> >
> 				ret =
> > lcore_config[lcore_id].f(fct_arg);
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > __
> >
> > test_ring_stress:				wmb()?
> > 	[ Konstantin focus ]
> >
> 	test_worker:
> > 	Main thread can use wrk_cmd to
> > 						Wrk_cmd =WRK_CMD_RUN;
> 	---------------------->	Wrk_cmd == WRK_CMD_RUN;
> > 	control multiple threads to start running
> >
> 				wmb()?
> > 	at the same time as much as possible
> >
> 				ring_dequeue;
> >
> 				ring_enqueue;
> > 						Wrk_cmd =WRK_CMD_STOP;
> 	---------------------->	Wrk_cmd == WRK_CMD_STOP;
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> >
> > rte_eal_wait_lcore:
> 				wmb()
> > 	[ Honnappa focus ]
> > 				lcore_config[lcore_id].state == FINISHED
> 	<---------------------	lcore_config[lcore_id].state =
> > FINISHED	Load-acquire and store-release
> >
> > 	are used on the variable “state”
> > 						rmb();
> >
> __________________________________________________________
> ____________
> > ___________________________________________
> > ___
> >
> > From the picture above,
> >
> > 1.First, for the underlying function rte_eal_remote_launch, Honnappa
> > focuses on that, pipeline_communication cannot ensure ‘arg’ parameters
> > is loaded correctly by the worker thread.
> > This is because in weak memory order framework, maybe the main thread
> > and worker thread firstly finish pipeline communication, and then the
> > worker thread receive signal and execute the function ‘ f ’. However,
> > it maybe load a wrong value of ‘arg’ due to that the main thread
> > stores ‘arg’ after pipeline communication. So wmb or store_release is
> necessary for ‘arg’.
> >
> > 2.Second, for the upper-layer test_ring_stress, Konstantin foucese on
> > that, Whether the main thread can use ‘wrk_cmd’ to control multiple
> > threads to run at the same time as much as possible.
> > Because rte_eal_remote_launch only can communicates with one worker
> > thread at the same time. This means some worker thread maybe start
> > working very early but other worker threads maybe need to wait a long
> > time to start working if  ‘wrk_cmd' is stored 'RUN' flag before
> rte_remote_launch.
> > At last, for unit test, this may cause that the test results are not stable.
> >
> > 3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as a
> synchronous bariable,
> > we should add load-acquire and store-release on it. However, there
> > have been rmb and wmb after and before ‘state’, So I’m not sure whether
> we should replace them.
> >
> > In summary, I think Honnappa and Konstantin have different concerns.
> > For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can
> > be loaded correctly in rte_eal_remote_launch.
> > For Konstantin, we can add wmb and rmb to ensure the main thread can
> > control the worker Threads to run at the same time, and then make the
> > test results more accurate in the ring_stress_test.
> 
> Agree with both.
Thanks Feifei, understood. I am just trying to take a step back and see what kind of ordering guarantees rte_eal_remote_launch should provide so that we do not have to deal with adding additional barriers in the applications. For ex: if we can avoid the barriers around 'wrk_cmd' (kind of use cases) it will benefit all the applications.

> 
> >
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> > > 发送时间: 2021年1月30日 9:24
> > > 收件人: Stephen Hemminger <stephen at networkplumber.org>
> > > 抄送: Ananyev, Konstantin <konstantin.ananyev at intel.com>; Feifei
> Wang
> > > <Feifei.Wang2 at arm.com>; dev at dpdk.org; nd <nd at arm.com>; Ruifeng
> Wang
> > > <Ruifeng.Wang at arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>
> > > 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary
> > > barriers for ring stress test
> > >
> > > <snip>
> > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > Hi Feifei,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > The variable "wrk_cmd" is a signal to control threads
> > > > > > > > > from running and stopping. When worker lcores load
> > > > > > > > > "wrk_cmd ==
> > > > > > WRK_CMD_RUN",
> > > > > > > > > they
> > > > > > > > start
> > > > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > > > WRK_CMD_STOP",
> > > > > > > > they
> > > > > > > > > stop.
> > > > > > > > >
> > > > > > > > > For the wmb in test_mt1, no storing operations must keep
> > > > > > > > > the order after storing "wrk_cmd". Thus the wmb is
> unnecessary.
> > > > > > > >
> > > > > > > > I think there is a bug in my original code, we should do
> > > > > > > > smp_wmb()
> > > > > > > > *before* setting wrk_cmd, not after:
> > > > > > > >
> > > > > > > >         /* launch on all workers */
> > > > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > > > >                 arg[lc].rng = r;
> > > > > > > >                 arg[lc].stats = init_stat;
> > > > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         /* signal worker to start test */
> > > > > > > > +      rte_smp_wmb();
> > > > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > > > -       rte_smp_wmb();
> > > > > > > >
> > > > > > > >         usleep(run_time * US_PER_S);
> > > > > > > >
> > > > > > > >
> > > > > > > > I still think we'd better have some synchronisation here.
> > > > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > > > We probably can safely assume no reordering from the
> > > > > > > > compiler here, as we have function calls straight before
> > > > > > > > and after 'wrk_cmd =
> > > > > > WRK_CMD_RUN;'
> > > > > > > > But for consistency and easier maintenance, I still think
> > > > > > > > it is better to have something here, after all it is not
> > > > > > > > performance critical
> > > > pass.
> > > > > > > Agree that this is not performance critical.
> > > > > > >
> > > > > > > This is more about correctness (as usually people refer to
> > > > > > > code to understand the concepts). You can refer to video [1].
> > > > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > > > i.e. all the memory operations before the pthread_create are
> > > > > > > visible to the new
> > > > thread.
> > > > > > > The
> > > > > > > rte_smp_rmb() barrier in the thread function is not required
> > > > > > > as it reads the
> > > > > > data that was set before the thread was launched.
> > > > > >
> > > > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > > > All it does -  updates global variable (lcore_config) and
> > > > > > writes/reads to/from the pipe.
> > > > > >
> > > > > Thanks for the reminder ☹
> > > > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to
> > > > > provide
> > > > behavior similar to pthread_launch and pthread_join respectively.
> > > > >
> > > > > There is use of rte_smp_*mb in those functions as well. Those
> > > > > need to be fixed
> > > > first and then look at these.
> > > >
> > > > Looks like you want __atomic_thread_fence() here.
> > > >
> > > In the rte_eal_remote_launch case, all the memory operations before
> > > the API call need to be visible to the worker. If this is the only
> > > requirement, we can use the function pointer as the guard variable
> > > and use store-release. In the eal_thread_loop function we could do
> > > load-acquire on the function pointer.
> > >
> > > I do not think that there is a requirement to ensure that the memory
> > > operations after the API call do not happen before the worker thread
> > > starts running the function (As there is no guarantee on when the
> > > worker thread will run. If the main thread needs to know if the
> > > worker thread is running explicit hand-shaking needs to happen).
> > >
> > > The rte_eal_wait_lcore API needs to ensure that the memory
> > > operations in the worker are visible to the main. rte_eal_wait_lcore
> > > and eal_thread_loop are synchronizing using
> > > lcore_config[worker_id].state. I need to understand what else
> > > 'state' is used for. If there are no issues, we can do a store-release on
> 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> > >
> > > So, we do not have to use the __atomic_thread_fence.
> > >



More information about the dev mailing list