[dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test

Ananyev, Konstantin konstantin.ananyev at intel.com
Mon Feb 1 14:50:53 CET 2021
Previous message (by thread): [dpdk-dev] 回复: [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
Next message (by thread): [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Feifei,

> 
> Hi, Honnappa, Konstantin and Stephen
> 
> Thanks very much for your attention of this patch. Based on your opinion, Ruifeng and I discuss about this and make a summary：
> _________________________________________________________________________________________________________________
> ___
> 						main thread						worker thread
> rte_eal_remote_launch:
> 	[ Honnappa focus ]
> 
> 	To ensure f can load correct arg,
> 
> 	arg store should before f
> 						lcore_config[worker_id].f = f;
> 						lcore_config[worker_id].arg = arg;
> 						wmb()? or store-relase on f?
> 
> 										eal_thread_loop:
> 						pipeline_communication	---------------------->	pipeline_communication
> 													if (lcore_config[lcore_id].f ==
> NULL)
> 													rte_panic("NULL function
> pointer\n");
> 
> 													fct_arg =
> lcore_config[lcore_id].arg;
> 													ret =
> lcore_config[lcore_id].f(fct_arg);
> _________________________________________________________________________________________________________________
> __
> 
> test_ring_stress:				wmb()?
> 	[ Konstantin focus ]
> 										test_worker:
> 	Main thread can use wrk_cmd to
> 						Wrk_cmd =WRK_CMD_RUN;	---------------------->	Wrk_cmd == WRK_CMD_RUN;
> 	control multiple threads to start running
> 													wmb()?
> 	at the same time as much as possible
> 													ring_dequeue;
> 													ring_enqueue;
> 						Wrk_cmd =WRK_CMD_STOP;	---------------------->	Wrk_cmd == WRK_CMD_STOP;
> _________________________________________________________________________________________________________________
> ___
> 
> rte_eal_wait_lcore:											wmb()
> 	[ Honnappa focus ]
> 				lcore_config[lcore_id].state == FINISHED	<---------------------	lcore_config[lcore_id].state =
> FINISHED	Load-acquire and store-release
> 
> 	are used on the variable “state”
> 						rmb();
> _________________________________________________________________________________________________________________
> ___
> 
> From the picture above,
> 
> 1.First, for the underlying function rte_eal_remote_launch, Honnappa focuses on that,
> pipeline_communication cannot ensure ‘arg’ parameters is loaded correctly by
> the worker thread.
> This is because in weak memory order framework, maybe the main thread and worker
> thread firstly finish pipeline communication, and then the worker thread receive signal
> and execute the function ‘ f ’. However, it maybe load a wrong value of ‘arg’ due to that
> the main thread stores ‘arg’ after pipeline communication. So wmb or store_release is
> necessary for ‘arg’.
> 
> 2.Second, for the upper-layer test_ring_stress, Konstantin foucese on that,
> Whether the main thread can use ‘wrk_cmd’ to control multiple threads to run at the
> same time as much as possible.
> Because rte_eal_remote_launch only can communicates with one worker thread
> at the same time. This means some worker thread maybe start working very early but other
> worker threads maybe need to wait a long time to start working if  ‘wrk_cmd' is stored 'RUN' flag
> before rte_remote_launch.
> At last, for unit test, this may cause that the test results are not stable.
> 
> 3.Third, for rte_eal_wait_lcore, Honnappa focuses on that the ‘state’ as a   synchronous bariable,
> we should add load-acquire and store-release on it. However, there have been rmb and wmb
> after and before ‘state’, So I’m not sure whether we should replace them.
> 
> In summary, I think Honnappa and Konstantin have different concerns.
> For Honnappa, we can add wmb or store-release to ensure the ‘arg’ can be loaded correctly
> in rte_eal_remote_launch.
> For Konstantin, we can add wmb and rmb to ensure the main thread can control the worker
> Threads to run at the same time, and then make the test results more accurate in the
> ring_stress_test.

Agree with both.

> 
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> > 发送时间: 2021年1月30日 9:24
> > 收件人: Stephen Hemminger <stephen at networkplumber.org>
> > 抄送: Ananyev, Konstantin <konstantin.ananyev at intel.com>; Feifei Wang
> > <Feifei.Wang2 at arm.com>; dev at dpdk.org; nd <nd at arm.com>; Ruifeng
> > Wang <Ruifeng.Wang at arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>
> > 主题: RE: [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers
> > for ring stress test
> >
> > <snip>
> >
> > > >
> > > > >
> > > > > > >
> > > > > > > Hi Feifei,
> > > > > > >
> > > > > > > >
> > > > > > > > The variable "wrk_cmd" is a signal to control threads from
> > > > > > > > running and stopping. When worker lcores load "wrk_cmd ==
> > > > > WRK_CMD_RUN",
> > > > > > > > they
> > > > > > > start
> > > > > > > > running and when worker lcores load "wrk_cmd ==
> > > > > > > > WRK_CMD_STOP",
> > > > > > > they
> > > > > > > > stop.
> > > > > > > >
> > > > > > > > For the wmb in test_mt1, no storing operations must keep the
> > > > > > > > order after storing "wrk_cmd". Thus the wmb is unnecessary.
> > > > > > >
> > > > > > > I think there is a bug in my original code, we should do
> > > > > > > smp_wmb()
> > > > > > > *before* setting wrk_cmd, not after:
> > > > > > >
> > > > > > >         /* launch on all workers */
> > > > > > >         RTE_LCORE_FOREACH_WORKER(lc) {
> > > > > > >                 arg[lc].rng = r;
> > > > > > >                 arg[lc].stats = init_stat;
> > > > > > >                 rte_eal_remote_launch(test, &arg[lc], lc);
> > > > > > >         }
> > > > > > >
> > > > > > >         /* signal worker to start test */
> > > > > > > +      rte_smp_wmb();
> > > > > > >         wrk_cmd = WRK_CMD_RUN;
> > > > > > > -       rte_smp_wmb();
> > > > > > >
> > > > > > >         usleep(run_time * US_PER_S);
> > > > > > >
> > > > > > >
> > > > > > > I still think we'd better have some synchronisation here.
> > > > > > > Otherwise what would prevent compiler and/or cpu to update
> > > > > > > wrk_cmd out of order (before _init_ phase is completed)?
> > > > > > > We probably can safely assume no reordering from the compiler
> > > > > > > here, as we have function calls straight before and after
> > > > > > > 'wrk_cmd =
> > > > > WRK_CMD_RUN;'
> > > > > > > But for consistency and easier maintenance, I still think it
> > > > > > > is better to have something here, after all it is not
> > > > > > > performance critical
> > > pass.
> > > > > > Agree that this is not performance critical.
> > > > > >
> > > > > > This is more about correctness (as usually people refer to code
> > > > > > to understand the concepts). You can refer to video [1].
> > > > > > Essentially, the pthread_create has 'happens-before' behavior.
> > > > > > i.e. all the memory operations before the pthread_create are
> > > > > > visible to the new
> > > thread.
> > > > > > The
> > > > > > rte_smp_rmb() barrier in the thread function is not required as
> > > > > > it reads the
> > > > > data that was set before the thread was launched.
> > > > >
> > > > > rte_eal_remote_launch() doesn't call pthread_create().
> > > > > All it does -  updates global variable (lcore_config) and
> > > > > writes/reads to/from the pipe.
> > > > >
> > > > Thanks for the reminder ☹
> > > > I think rte_eal_remote_launch and rte_eal_wait_lcore need to provide
> > > behavior similar to pthread_launch and pthread_join respectively.
> > > >
> > > > There is use of rte_smp_*mb in those functions as well. Those need
> > > > to be fixed
> > > first and then look at these.
> > >
> > > Looks like you want __atomic_thread_fence() here.
> > >
> > In the rte_eal_remote_launch case, all the memory operations before the
> > API call need to be visible to the worker. If this is the only requirement, we
> > can use the function pointer as the guard variable and use store-release. In
> > the eal_thread_loop function we could do load-acquire on the function
> > pointer.
> >
> > I do not think that there is a requirement to ensure that the memory
> > operations after the API call do not happen before the worker thread starts
> > running the function (As there is no guarantee on when the worker thread
> > will run. If the main thread needs to know if the worker thread is running
> > explicit hand-shaking needs to happen).
> >
> > The rte_eal_wait_lcore API needs to ensure that the memory operations in
> > the worker are visible to the main. rte_eal_wait_lcore and eal_thread_loop
> > are synchronizing using lcore_config[worker_id].state. I need to understand
> > what else 'state' is used for. If there are no issues, we can do a store-release
> > on 'state' in eal_thread_loop and a load-acquire in rte_eal_wait_lcore.
> >
> > So, we do not have to use the __atomic_thread_fence.
> >
Previous message (by thread): [dpdk-dev] 回复: [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
Next message (by thread): [dpdk-dev] [PATCH v1 1/2] app/test: remove unnecessary barriers for ring stress test
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list