[dpdk-dev] [PATCH v2 2/5] distributor: new packet distributor library
Richardson, Bruce
bruce.richardson at intel.com
Mon Jun 2 23:40:04 CEST 2014
> -----Original Message-----
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Thursday, May 29, 2014 6:48 AM
> To: Richardson, Bruce
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 2/5] distributor: new packet distributor library
>
> > +
> > +/* flush the distributor, so that there are no outstanding packets in flight or
> > + * queued up. */
> Its not clear to me that this is a distributor only function. You modified the
> comments to indicate that lcores can't preform double duty as both a worker
> and
> a distributor, which is fine, but it implies that there is a clear distinction
> between functions that are 'worker' functions and 'distributor' functions.
> While its for the most part clear-ish (workers call rte_distributor_get_pkt and
> rte_distibutor_return_pkt, distibutors calls rte_distributor_create/process.
> This is in a grey area. the analogy I'm thinking of here are kernel workqueues.
> Theres a specific workqueue thread that processes the workqueue, but any
> process
> can sync or flush the workqueue, leading me to think this process can be called
> by a worker lcore.
I can update comments here further, but I was hoping the way things were right now was clear enough. In the header and C files, I have the functions explicitly split up into distributor and worker function sets, with a big block of text in the header at the start of each section explaining the threading use of the follow functions.
>
> > +int
> > +rte_distributor_flush(struct rte_distributor *d)
> > +{
> > + unsigned wkr, total_outstanding = 0;
> > + unsigned flushed = 0;
> > + unsigned ret_start = d->returns.start,
> > + ret_count = d->returns.count;
> > +
> > + for (wkr = 0; wkr < d->num_workers; wkr++)
> > + total_outstanding += d->backlog[wkr].count +
> > + !!(d->in_flight_tags[wkr]);
> > +
> > + wkr = 0;
> > + while (flushed < total_outstanding) {
> > +
> > + if (d->in_flight_tags[wkr] != 0 || d->backlog[wkr].count) {
> > + const int64_t data = d->bufs[wkr].bufptr64;
> > + uintptr_t oldbuf = 0;
> > +
> > + if (data & RTE_DISTRIB_GET_BUF) {
> > + flushed += (d->in_flight_tags[wkr] != 0);
> > + if (d->backlog[wkr].count) {
> > + d->bufs[wkr].bufptr64 =
> > + backlog_pop(&d-
> >backlog[wkr]);
> > + /* we need to mark something as being
> > + * in-flight, but it doesn't matter what
> > + * as we never check it except
> > + * to check for non-zero.
> > + */
> > + d->in_flight_tags[wkr] = 1;
> > + } else {
> > + d->bufs[wkr].bufptr64 =
> > +
> RTE_DISTRIB_GET_BUF;
> > + d->in_flight_tags[wkr] = 0;
> > + }
> > + oldbuf = data >> RTE_DISTRIB_FLAG_BITS;
> > + } else if (data & RTE_DISTRIB_RETURN_BUF) {
> > + if (d->backlog[wkr].count == 0 ||
> > + move_backlog(d, wkr) == 0) {
> > + /* only if we move backlog,
> > + * process this packet */
> > + d->bufs[wkr].bufptr64 = 0;
> > + oldbuf = data >>
> RTE_DISTRIB_FLAG_BITS;
> > + flushed++;
> > + d->in_flight_tags[wkr] = 0;
> > + }
> > + }
> > +
> > + store_return(oldbuf, d, &ret_start, &ret_count);
> > + }
> > +
> I know the comments for move_backlog say you use that function here rather
> than
> what you do in distributor_process because you're tracking the flush count here.
> That said, if you instead recomputed the total_outstanding count on each loop
> iteration, and tested it for 0, I think you could just reduce the flush
> operation to a looping call to rte_distributor_process. It would save you
> having to maintain the flush code and the move_backlog code separately, which
> would be a nice savings.
Yes, agreed, I should have spotted that myself. I'll look to rework this as soon as I can.
>
> > + if (++wkr == d->num_workers)
> > + wkr = 0;
> Nit: wkr = ++wkr % d->num_workers avoids the additional branch in your loop
>
a) branch should be easily predictable as the number of workers doesn't change. So long as branch doesn't mis-predict there should be little or no perf penalty to having it.
b) The compare plus update can also be done branchless using a "cmov" instruction if we want branch free code.
c) The modulus operator is very slow and takes far more cycles than a branch would do. If we could limit the number of workers to a power of two, then an & operation would work nicely, but that is too big a restriction to have.
So, in short, I think I'll keep this snippet as-is. :-)
More information about the dev
mailing list