[dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:

Neil Horman nhorman at tuxdriver.com
Fri Sep 26 13:46:30 CEST 2014


On Thu, Sep 25, 2014 at 11:24:30PM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > Sent: Thursday, September 25, 2014 6:24 PM
> > To: Ananyev, Konstantin
> > Cc: Jastrzebski, MichalX K; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
> > 
> > On Thu, Sep 25, 2014 at 04:03:48PM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Neil Horman
> > > > Sent: Thursday, September 25, 2014 4:08 PM
> > > > To: Jastrzebski, MichalX K
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
> > > >
> > > > On Thu, Sep 25, 2014 at 01:56:08PM +0100, Michal Jastrzebski wrote:
> > > > >     Change alarm cancel function to thread-safe.
> > > > >     It eliminates a race between threads using rte_alarm_cancel and
> > > > >     rte_alarm_set.
> > > > >
> > > > > Signed-off-by: Pawel Wodkowski <pawelx.wodkowski at intel.com>
> > > > > Reviewed-by: Michal Jastrzebski <michalx.k.jastrzebski at intel.com>
> > > > >
> > > > > ---
> > > > >  lib/librte_eal/common/include/rte_alarm.h |    3 +-
> > > > >  lib/librte_eal/linuxapp/eal/eal_alarm.c   |   68 ++++++++++++++++++-----------
> > > > >  2 files changed, 45 insertions(+), 26 deletions(-)
> > > > >
> > > >
> > > > > diff --git a/lib/librte_eal/common/include/rte_alarm.h b/lib/librte_eal/common/include/rte_alarm.h
> > > > > index d451522..e7cbaef 100644
> > > > > --- a/lib/librte_eal/common/include/rte_alarm.h
> > > > > +++ b/lib/librte_eal/common/include/rte_alarm.h
> > > > > @@ -76,7 +76,8 @@ typedef void (*rte_eal_alarm_callback)(void *arg);
> > > > >  int rte_eal_alarm_set(uint64_t us, rte_eal_alarm_callback cb, void *cb_arg);
> > > > >
> > > > >  /**
> > > > > - * Function to cancel an alarm callback which has been registered before.
> > > > > + * Function to cancel an alarm callback which has been registered before. If
> > > > > + * used outside alarm callback it wait for all callbacks to finish its execution.
> > > > >   *
> > > > >   * @param cb_fn
> > > > >   *  alarm callback
> > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_alarm.c b/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > > > > index 480f0cb..ea8dfb4 100644
> > > > > --- a/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > > > > +++ b/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > > > > @@ -69,7 +69,8 @@ struct alarm_entry {
> > > > >  	struct timeval time;
> > > > >  	rte_eal_alarm_callback cb_fn;
> > > > >  	void *cb_arg;
> > > > > -	volatile int executing;
> > > > > +	volatile uint8_t executing;
> > > > > +	volatile pthread_t executing_id;
> > > > >  };
> > > > >
> > > > >  static LIST_HEAD(alarm_list, alarm_entry) alarm_list = LIST_HEAD_INITIALIZER();
> > > > > @@ -108,11 +109,13 @@ eal_alarm_callback(struct rte_intr_handle *hdl __rte_unused,
> > > > >  			(ap->time.tv_sec < now.tv_sec || (ap->time.tv_sec == now.tv_sec &&
> > > > >  						ap->time.tv_usec <= now.tv_usec))){
> > > > >  		ap->executing = 1;
> > > > > +		ap->executing_id = pthread_self();
> > > > How exactly does this work?  From my read all alarm callbacks are handled by the
> > > > thread created in rte_eal_intr_init (which runs forever in
> > > > eal_intr_thread_main()).
> > >
> > > In current implementation - yes.
> > >
> > >  So every assignment to the above executing_id value
> > > > will be from that thread.  As such, anytime rte_eal_alarm_cancel is called from
> > > > within a callback we are guaranteed that:
> > > > a) the ap->executing flag is set to 1
> > > > b) the ap->executing_id value should equal whatever is returned from
> > > > pthread_self()
> > >
> > > Yes
> > >
> > > >
> > > > That will cause the executing counter local to the cancel function to get
> > > > incremented, meaning we will deadlock withing that do { ... } while (executing
> > > > != 0) loop, no?
> > >
> > > No, as for the case when cancel is called from callback:
> > > pthread_equal(ap->executing_id, pthread_self())
> > > would return non-zero value (which means threads ids are equal), so executing will not be incremented.
> > >
> > Ah, pthread_equal is one of the backwards functions that returns zero for
> > inequality.  Maybe then rewrite that as:
> > if (!pthread_equal(...)
> > 
> > So its clear that we're looking for inequality there to increment?
> > 
> > > >
> > > > >  		rte_spinlock_unlock(&alarm_list_lk);
> > > > >
> > > > >  		ap->cb_fn(ap->cb_arg);
> > > > >
> > > > >  		rte_spinlock_lock(&alarm_list_lk);
> > > > > +
> > > > >  		LIST_REMOVE(ap, next);
> > > > >  		rte_free(ap);
> > > > >  	}
> > > > > @@ -145,7 +148,7 @@ rte_eal_alarm_set(uint64_t us, rte_eal_alarm_callback cb_fn, void *cb_arg)
> > > > >  	if (us < 1 || us > (UINT64_MAX - US_PER_S) || cb_fn == NULL)
> > > > >  		return -EINVAL;
> > > > >
> > > > > -	new_alarm = rte_malloc(NULL, sizeof(*new_alarm), 0);
> > > > > +	new_alarm = rte_zmalloc(NULL, sizeof(*new_alarm), 0);
> > > > >  	if (new_alarm == NULL)
> > > > >  		return -ENOMEM;
> > > > >
> > > > > @@ -156,7 +159,6 @@ rte_eal_alarm_set(uint64_t us, rte_eal_alarm_callback cb_fn, void *cb_arg)
> > > > >  	new_alarm->cb_arg = cb_arg;
> > > > >  	new_alarm->time.tv_usec = (now.tv_usec + us) % US_PER_S;
> > > > >  	new_alarm->time.tv_sec = now.tv_sec + ((now.tv_usec + us) / US_PER_S);
> > > > > -	new_alarm->executing = 0;
> > > > >
> > > > This removes the only place where ->executing is cleared again.  If there is
> > > > only one change to this bits state (which is the case after this patch), it
> > > > seems that you can just use the executing bit as the test in the alarm_cancel
> > > > function, and remove all the pthread_self mess.
> > >
> > > I believe we do need executing_id here.
> > > It allows us to distinguish are we executing cancel from a callback or not.
> > >
> > Given what you said above, I agree, at least in the current implementation.  It
> > still seems like theres a simpler solution that doesn't require all the
> > comparative gymnastics.
> > 
> > What if, instead of testing if you're the callback thread, we turn the executing
> > field of alarm_entry into a bitfield, where bit 0 represents the former
> > "executing" state, and bit 1 is defined as a "cancelled" bit.  Then
> > rte_eal_alarm_cancel becomes a search that, when an alarm is found simply or's
> > in the cancelled bit to the executing bit field.  When the callback thread runs,
> > it skips executing any alarm that is marked as cancelled, but frees all alarm
> > entries that are executed or cancelled.  That gives us a single point at which
> > frees of alarm entires happen?  Something like the patch below (completely
> > untested)?
> 
> So basically cancel() just set ALARM_CANCELLED and leaves actual alarm deletion to the callback()?
That was the thought, yes.

> I think it is doable - but I don't see any real advantage with that approach.
> Yes, code will become a bit simpler, as  we'll have one point when we remove alarm from the list.
Yes, that would be the advantage, that the code would be much simpler.

> But from other side, imagine such simple test-case:
> 
> for (i = 0; i < 0x100000; i++) {
>    rte_eal_alarm_set(ONE_MIN, cb_func, (void *)i);
>    rte_eal_alarm_cancel(cb_func, (void *)i);
> } 
> 
> We'll endup with 1M of cancelled, but still not removed entries in the alarm_list.
> With current implementation that means - few MBs of wasted memory,
Thats correct, and the tradeoff to choose between.  Do you want simpler code
that is easier to maintain, or do you want a high speed cancel and set
operation.  I'm not aware of all the use cases, but I have a hard time seeing
a use case in which the in-flight alarm list grows unboundedly large, which in
my mind mitigates the risk of deferred removal, but I'm perfectly willing to
believe that there are use cases which I'm not aware of.

> plus very slow set() and cancel(), as they'll  have to traverse all entries in the list.  
> And all that - for empty from user perspective alarm_list 
> So I still prefer Michal's way.
> After all, it doesn't look that complicated to me. 
Except that the need for Michals fix arose from the fact that we have two free
locations that might both get called depending on the situation.  Thats what I'm
trying to address here, the complexity itself, rather than the fix (which I
agree is perfectly valid).

> BTW, any particular reason you are so negative about pthread_self()?
> 
Nothing specifically against it (save for its inverted return code sense, which
made it difficult for me to parse when reviewing).  Its more the complexity
itself in the alarm cancel and callback routine that I was looking at.  Given
that the origional bug happened because an cancel in a callback might produce a
double free, I wanted to fix it by simpifying the code, not adding conditions
which make it more complex.

You know, looking at it, something else just occured to me.  I think this could
all be fixed without the cancel flag or the pthread_self assignments.  What if
we simply removed the alarm from the list before we called the callback in
rte_eal_alarm_callback()?  That way any cancel operation called from within the
callback would fail, as it wouldn't appear on the list, and the callback
operation would be the only freeing entity?  That would let you still have a
fast set and cancel, and avoid the race.  Thoughts?  Untested sample patch below


> > 
> > It also seems like the alarm api as a whole could use some improvement.  The
> > way its written right now, theres no way to refer to a specific alarm (i.e.
> > cancelation relies on the specification of a function and data pointer, which
> > may refer to multiple timers).  Shouldn't rte_eal_alarm_set return an opaque
> > handle to a unique timer instance that can be store by a caller and used to
> > specfically cancel that timer?  Thats how both the bsd and linux timer
> > subsystems model timers.
> 
> Yeh,  alarm API looks a bit unusual. 
> Though, I suppose that's subject for another patch/discussion :)
> 
Yes, agreed :)




diff --git a/lib/librte_eal/linuxapp/eal/eal_alarm.c b/lib/librte_eal/linuxapp/eal/eal_alarm.c
index 480f0cb..d586ff4 100644
--- a/lib/librte_eal/linuxapp/eal/eal_alarm.c
+++ b/lib/librte_eal/linuxapp/eal/eal_alarm.c
@@ -107,13 +107,13 @@ eal_alarm_callback(struct rte_intr_handle *hdl __rte_unused,
 			gettimeofday(&now, NULL) == 0 &&
 			(ap->time.tv_sec < now.tv_sec || (ap->time.tv_sec == now.tv_sec &&
 						ap->time.tv_usec <= now.tv_usec))){
+		LIST_REMOVE(ap, next);
 		ap->executing = 1;
 		rte_spinlock_unlock(&alarm_list_lk);
 
 		ap->cb_fn(ap->cb_arg);
 
 		rte_spinlock_lock(&alarm_list_lk);
-		LIST_REMOVE(ap, next);
 		rte_free(ap);
 	}
 


More information about the dev mailing list