[dpdk-dev] [PATCH v2 2/5] distributor: new packet distributor library

Richardson, Bruce bruce.richardson at intel.com
Mon Jun 2 23:40:04 CEST 2014



> -----Original Message-----
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Thursday, May 29, 2014 6:48 AM
> To: Richardson, Bruce
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 2/5] distributor: new packet distributor library
> 
> > +
> > +/* flush the distributor, so that there are no outstanding packets in flight or
> > + * queued up. */
> Its not clear to me that this is a distributor only function.  You modified the
> comments to indicate that lcores can't preform double duty as both a worker
> and
> a distributor, which is fine, but it implies that there is a clear distinction
> between functions that are 'worker' functions and 'distributor' functions.
> While its for the most part clear-ish (workers call rte_distributor_get_pkt and
> rte_distibutor_return_pkt, distibutors calls rte_distributor_create/process.
> This is in a grey area.  the analogy I'm thinking of here are kernel workqueues.
> Theres a specific workqueue thread that processes the workqueue, but any
> process
> can sync or flush the workqueue, leading me to think this process can be called
> by a worker lcore.

I can update comments here further, but I was hoping the way things were right now was clear enough. In the header and C files, I have the functions explicitly split up into distributor and worker function sets, with a big block of text in the header at the start of each section explaining the threading use of the follow functions. 

> 
> > +int
> > +rte_distributor_flush(struct rte_distributor *d)
> > +{
> > +	unsigned wkr, total_outstanding = 0;
> > +	unsigned flushed = 0;
> > +	unsigned ret_start = d->returns.start,
> > +			ret_count = d->returns.count;
> > +
> > +	for (wkr = 0; wkr < d->num_workers; wkr++)
> > +		total_outstanding += d->backlog[wkr].count +
> > +				!!(d->in_flight_tags[wkr]);
> > +
> > +	wkr = 0;
> > +	while (flushed < total_outstanding) {
> > +
> > +		if (d->in_flight_tags[wkr] != 0 || d->backlog[wkr].count) {
> > +			const int64_t data = d->bufs[wkr].bufptr64;
> > +			uintptr_t oldbuf = 0;
> > +
> > +			if (data & RTE_DISTRIB_GET_BUF) {
> > +				flushed += (d->in_flight_tags[wkr] != 0);
> > +				if (d->backlog[wkr].count) {
> > +					d->bufs[wkr].bufptr64 =
> > +						backlog_pop(&d-
> >backlog[wkr]);
> > +					/* we need to mark something as being
> > +					 * in-flight, but it doesn't matter what
> > +					 * as we never check it except
> > +					 * to check for non-zero.
> > +					 */
> > +					d->in_flight_tags[wkr] = 1;
> > +				} else {
> > +					d->bufs[wkr].bufptr64 =
> > +
> 	RTE_DISTRIB_GET_BUF;
> > +					d->in_flight_tags[wkr] = 0;
> > +				}
> > +				oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > +			} else if (data & RTE_DISTRIB_RETURN_BUF) {
> > +				if (d->backlog[wkr].count == 0 ||
> > +						move_backlog(d, wkr) == 0) {
> > +					/* only if we move backlog,
> > +					 * process this packet */
> > +					d->bufs[wkr].bufptr64 = 0;
> > +					oldbuf = data >>
> RTE_DISTRIB_FLAG_BITS;
> > +					flushed++;
> > +					d->in_flight_tags[wkr] = 0;
> > +				}
> > +			}
> > +
> > +			store_return(oldbuf, d, &ret_start, &ret_count);
> > +		}
> > +
> I know the comments for move_backlog say you use that function here rather
> than
> what you do in distributor_process because you're tracking the flush count here.
> That said, if you instead recomputed the total_outstanding count on each loop
> iteration, and tested it for 0, I think you could just reduce the flush
> operation to a looping call to rte_distributor_process.  It would save you
> having to maintain the flush code and the move_backlog code separately, which
> would be a nice savings.

Yes, agreed, I should have spotted that myself. I'll look to rework this as soon as I can.

> 
> > +		if (++wkr == d->num_workers)
> > +			wkr = 0;
> Nit: wkr = ++wkr % d->num_workers avoids the additional branch in your loop
> 
a) branch should be easily predictable as the number of workers doesn't change. So long as branch doesn't mis-predict there should be little or no perf penalty to having it. 
b) The compare plus update can also be done branchless using a "cmov" instruction if we want branch free code.
c) The modulus operator is very slow and takes far more cycles than a branch would do. If we could limit the number of workers to a power of two, then an & operation would work nicely, but that is too big a restriction to have.
So, in short, I think I'll keep this snippet as-is. :-)


More information about the dev mailing list