[dpdk-dev] mbuf changes

Ananyev, Konstantin konstantin.ananyev at intel.com
Tue Oct 25 12:02:47 CEST 2016


Hi everyone,

> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, October 25, 2016 9:54 AM
> To: Morten Brørup <mb at smartsharesystems.com>
> Cc: Wiles, Keith <keith.wiles at intel.com>; dev at dpdk.org; Olivier Matz <olivier.matz at 6wind.com>
> Subject: Re: [dpdk-dev] mbuf changes
> 
> On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Brørup wrote:
> > Comments inline.
> >
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Monday, October 24, 2016 6:26 PM
> > > To: Wiles, Keith
> > > Cc: Morten Brørup; dev at dpdk.org; Olivier Matz
> > > Subject: Re: [dpdk-dev] mbuf changes
> > >
> > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > >
> > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup <mb at smartsharesystems.com>
> > > wrote:
> > > > >
> > > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > > >
> > > > >
> > > > >
> > > > > Continuing the Userspace discussion about Olivier Matz’s proposed mbuf
> > > changes...
> > >
> > > Thanks for keeping the discussion going!
> > > > >
> > > > >
> > > > >
> > > > > 1.
> > > > >
> > > > > Stephen Hemminger had a noteworthy general comment about keeping
> > > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > > generated by the NIC’s RX handler belongs in the first cache line, and
> > > metadata required by the NIC’s TX handler belongs in the second cache line.
> > > This also means that touching the second cache line on ingress should be
> > > avoided if possible; and Bruce Richardson mentioned that for this reason m-
> > > >next was zeroed on free().
> > > > >
> > > Thinking about it, I suspect there are more fields we can reset on free
> > > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > > too could be the nb_segs field and possibly others.
> > Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() for this purpose.
> >
> Possibly. We just want to make sure that we don't reset too much either!
> :-)
> 
> >
> > > > > 2.
> > > > >
> > > > > There seemed to be consensus that the size of m->refcnt should match
> > > the size of m->port because a packet could be duplicated on all physical
> > > ports for L3 multicast and L2 flooding.
> > > > >
> > > > > Furthermore, although a single physical machine (i.e. a single server)
> > > with 255 physical ports probably doesn’t exist, it might contain more than
> > > 255 virtual machines with a virtual port each, so it makes sense extending
> > > these mbuf fields from 8 to 16 bits.
> > > >
> > > > I thought we also talked about removing the m->port from the mbuf as it
> > > is not really needed.
> > > >
> > > Yes, this was mentioned, and also the option of moving the port value to
> > > the second cacheline, but it appears that NXP are using the port value
> > > in their NIC drivers for passing in metadata, so we'd need their
> > > agreement on any move (or removal).
> > >
> > If a single driver instance services multiple ports, so the ports are not polled individually, the m->port member will be required to identify
> the port. In that case, it might also used elsewhere in the ingress path, and thus appropriate having in the first cache line. 

Ok, but these days most of devices have multiple rx queues.
So identify the RX source properly you need not only port, but port+queue (at least 3 bytes).
But I suppose better to wait for NXP input here. 

>So yes, this needs
> further discussion with NXP.
> >
> >
> > > > > 3.
> > > > >
> > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > > the second cache line, which then generated questions from the audience
> > > about the real life purpose of m->port, and if m->port could be removed
> > > from the mbuf structure.
> > > > >
> > > > >
> > > > >
> > > > > 4.
> > > > >
> > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > > first allocation. This is based on the assumption that other mbuf fields
> > > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > > than setting it to 1.
> > > > >
> > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > > not required to modify m->refcnt when allocating and freeing the mbuf, thus
> > > saving one write operation on both alloc() and free(). However, this
> > > assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> > >
> > > I don't think it really matters what sentinal value is used for the
> > > refcnt because it can't be blindly assigned on free like other fields.
> > > However, I think 0 as first reference value becomes more awkward
> > > than 1, because we need to deal with underflow. Consider the situation
> > > where we have two references to the mbuf, so refcnt is 1, and both are
> > > freed at the same time. Since the refcnt is not-zero, then both cores
> > > will do an atomic decrement simultaneously giving a refcnt of -1. We can
> > > then set this back to zero before freeing, however, I'd still prefer to
> > > have refcnt be an accurate value so that it always stays positive, and
> > > we can still set it to "one" on free to avoid having to set on alloc.
> > >
> > Good example. It explains very clearly why my suggested optimization is tricky. My optimization is targeting the most likely scenario
> where m->refcnt is only "one", but at the expense of some added complexity for the unlikely scenarios. (I only suggested 0 as "one"
> because I assumed the value could be zeroed faster than set to 1, so forget about that, if you like.)
> >
> > I still argue that if we keep m->refcnt at "one" value while in the free pool, a write operation can be avoided at both free() and alloc().
> Here is a conceptual example (using m->refcnt==1 for "one"):
> >
> Yes, I agree with you on this.
> 
> > void __rte_mbuf_raw_free(struct rte_mbuf *m)
> > {
> > 	RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
> > 	rte_mempool_put(m->pool, m);
> > }
> >
> > void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m)
> > {
> > 	/* if this is an indirect mbuf, it is detached. */
> > 	if (RTE_MBUF_INDIRECT(m))
> > 		rte_pktmbuf_detach(m);
> > 	m->next = NULL; /* Preload to avoid setting in alloc(). */
> > 	__rte_mbuf_raw_free(m);
> > }
> >
> > void rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > {
> > 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> > 		/* We have the last reference, so we return it to the free pool. */
> > 		__rte_pktmbuf_free_seg_inner(m);
> > 	} else {
> > 		/* More references, so decrement the refcnt. */
> > 		if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 			/* Someone raced to decrement the refcnt before us,
> > 			   so now we have the last reference,
> > 			   so we return it to the free pool. */
> > 		    rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). */
> > 			__rte_pktmbuf_free_seg_inner(m);
> > 		}
> > 	}
> > }
> >
> >
> > > Also, if we set refcnt on free rather than alloc, it does set itself up
> > > as a good candidate for moving to the second cacheline. Fast-path
> > > processing does not normally update the value.
> > >
> > Agreed. But also consider that TX handling should be optimized too. And this includes free() processing and preloading values for alloc();
> this should only be done in free() instead of alloc() if it provides a total performance improvement. We should not improve RX handling
> performance if the TX handling performance decreases much more.
> 
> Agreed.
> >
> >
> > > > > 5.
> > > > >
> > > > > And here’s something new to think about:
> > > > >
> > > > > m->next already reveals if there are more segments to a packet. Which
> > > purpose does m->nb_segs serve that is not already covered by m->next?
> > >
> > > It is duplicate info, but nb_segs can be used to check the validity of
> > > the next pointer without having to read the second mbuf cacheline.
> > >
> > > Whether it's worth having is something I'm happy enough to discuss,
> > > though.
> > If anybody needs to check for additional segments (m->next != NULL) without hitting the second cache line, a single bit in the first cache
> line will suffice. (If this is a common case, they probably need to read the m->next value anyway, and then it would optimally belong in the
> first cache line. I am just mentioning this. It's not my preference.)
> >
> >
> > > One other point I'll mention is that we need to have a discussion on
> > > how/where to add in a timestamp value into the mbuf. Personally, I think
> > > it can be in a union with the sequence number value, but I also suspect
> > > that 32-bits of a timestamp is not going to be enough for many.
> > >
> > > Thoughts?
> > Speaking for myself, a high resolution RX timestamp is "nice to have", so I like it, but I would hate to sacrifice the sequence number for it.
> Generally, I would assume that the sequence number used for reordering is more important than a timestamp, except for applications
> dealing with timing measurements or high precision packet capturing.
> >
> > Conversion of timestamps from one unit to another can be relatively expensive, so I recommend keeping timestamps in whatever
> resolution the NIC hardware provides, leaving the conversion to e.g. nanoseconds up to the applications that use timestamps. A lot of
> protocols use opaque timestamps, e.g. RTP, so this would not be exceptional. Since the timestamp belongs in the first cache line, I suggest
> we limit it to 32 bit. This can be the least significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the application to keep track
> of the most significant 32 bit by some other means. If it's counting nanoseconds, 32 bit can hold 4 seconds.
> >
> 
> I worry that 32-bits is not enough. If we do use a 32-bit value, I also
> think that in many cases we can't just take the low 32-bits. For a 2GHz
> IA CPU the rdtsc() value would wrap around in just 2 seconds if
> truncated to 32 bits, for instance. In a case like that, we may be
> better to take 32-bits from somewhere in the middle - trading accuracy
> for better wrap-around time.

I also think that 32bit for timestamps would not be enough.
Though as we agree that first cache line should be for fields filled by fast-path RX,
I suppose timestamp (and seqn) should be in the second line and shouldn't be filled by PMD itself.
About what should be stored here (tsc value/system clock in ns/something else) -
I think it is better left to particular SW that would fill/use that field to decide.
For me personally tsc seems like the most appropriate.
Konstantin



More information about the dev mailing list