[dpdk-dev] mbuf changes

Bruce Richardson bruce.richardson at intel.com
Tue Oct 25 14:20:09 CEST 2016


On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> Comments inline.
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, October 25, 2016 1:14 PM
> > To: Adrien Mazarguil
> > Cc: Morten Brørup; Wiles, Keith; dev at dpdk.org; Olivier Matz; Oleg
> > Kuporosov
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil wrote:
> > > On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > > > Comments inline.
> > > >
> > > > Med venlig hilsen / kind regards
> > > > - Morten Brørup
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > > > Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > To: Bruce Richardson
> > > > > Cc: Wiles, Keith; Morten Brørup; dev at dpdk.org; Olivier Matz; Oleg
> > > > > Kuporosov
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > > On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson wrote:
> > > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > > > > [...]
> > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > <mb at smartsharesystems.com> wrote:
> > > > > [...]
> > > > > > > > 5.
> > > > > > > >
> > > > > > > > And here’s something new to think about:
> > > > > > > >
> > > > > > > > m->next already reveals if there are more segments to a
> > packet.
> > > > > Which purpose does m->nb_segs serve that is not already covered
> > by
> > > > > m-
> > > > > >next?
> > > > > >
> > > > > > It is duplicate info, but nb_segs can be used to check the
> > > > > > validity
> > > > > of
> > > > > > the next pointer without having to read the second mbuf
> > cacheline.
> > > > > >
> > > > > > Whether it's worth having is something I'm happy enough to
> > > > > > discuss, though.
> > > > >
> > > > > Although slower in some cases than a full blown "next packet"
> > > > > pointer, nb_segs can also be conveniently abused to link several
> > > > > packets and their segments in the same list without wasting
> > space.
> > > >
> > > > I don’t understand that; can you please elaborate? Are you abusing
> > m->nb_segs as an index into an array in your application? If that is
> > the case, and it is endorsed by the community, we should get rid of m-
> > >nb_segs and add a member for application specific use instead.
> > >
> > > Well, that's just an idea, I'm not aware of any application using
> > > this, however the ability to link several packets with segments seems
> > > useful to me (e.g. buffering packets). Here's a diagram:
> > >
> > >  .-----------.   .-----------.   .-----------.   .-----------.   .---
> > ---
> > >  | pkt 0     |   | seg 1     |   | seg 2     |   | pkt 1     |   |
> > pkt 2
> > >  |      next --->|      next --->|      next --->|      next --->|
> > ...
> > >  | nb_segs 3 |   | nb_segs 1 |   | nb_segs 1 |   | nb_segs 1 |   |
> > >  `-----------'   `-----------'   `-----------'   `-----------'   `---
> > ---
> 
> I see. It makes it possible to refer to a burst of packets (with segments or not) by a single mbuf reference, as an alternative to the current design pattern of using an array and length (struct rte_mbuf **mbufs, unsigned count).
> 
> This would require implementation in the PMDs etc.
> 
> And even in this case, m->nb_segs does not need to be an integer, but could be replaced by a single bit indicating if the segment is a continuation of a packet or the beginning (alternatively the end) of a packet, i.e. the bit can be set for either the first or the last segment in the packet.
> 
> It is an almost equivalent alternative to the fundamental design pattern of using an array of mbuf with count, which is widely implemented in DPDK. And m->next still lives in the second cache line, so I don't see any gain by this.
> 
> I still don't get how m->nb_segs can be abused without m->next.
> 
> 
> > > > > > One other point I'll mention is that we need to have a
> > > > > > discussion on how/where to add in a timestamp value into the
> > > > > > mbuf. Personally, I think it can be in a union with the
> > sequence
> > > > > > number value, but I also suspect that 32-bits of a timestamp is
> > > > > > not going to be enough for
> > > > > many.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > If we consider that timestamp representation should use
> > nanosecond
> > > > > granularity, a 32-bit value may likely wrap around too quickly to
> > > > > be useful. We can also assume that applications requesting
> > > > > timestamps may care more about latency than throughput, Oleg
> > found
> > > > > that using the second cache line for this purpose had a
> > noticeable impact [1].
> > > > >
> > > > >  [1] http://dpdk.org/ml/archives/dev/2016-October/049237.html
> > > >
> > > > I agree with Oleg about the latency vs. throughput importance for
> > such applications.
> > > >
> > > > If you need high resolution timestamps, consider them to be
> > generated by the NIC RX driver, possibly by the hardware itself
> > (http://w3new.napatech.com/features/time-precision/hardware-time-
> > stamp), so the timestamp belongs in the first cache line. And I am
> > proposing that it should have the highest possible accuracy, which
> > makes the value hardware dependent.
> > > >
> > > > Furthermore, I am arguing that we leave it up to the application to
> > keep track of the slowly moving bits (i.e. counting whole seconds,
> > hours and calendar date) out of band, so we don't use precious space in
> > the mbuf. The application doesn't need the NIC RX driver's fast path to
> > capture which date (or even which second) a packet was received. Yes,
> > it adds complexity to the application, but we can't set aside 64 bit
> > for a generic timestamp. Or as a weird tradeoff: Put the fast moving 32
> > bit in the first cache line and the slow moving 32 bit in the second
> > cache line, as a placeholder for the application to fill out if needed.
> > Yes, it means that the application needs to check the time and update
> > its variable holding the slow moving time once every second or so; but
> > that should be doable without significant effort.
> > >
> > > That's a good point, however without a 64 bit value, elapsed time
> > > between two arbitrary mbufs cannot be measured reliably due to not
> > > enough context, one way or another the low resolution value is also
> > needed.
> > >
> > > Obviously latency-sensitive applications are unlikely to perform
> > > lengthy buffering and require this but I'm not sure about all the
> > > possible use-cases. Considering many NICs expose 64 bit timestaps, I
> > > suggest we do not truncate them.
> > >
> > > I'm not a fan of the weird tradeoff either, PMDs will be tempted to
> > > fill the extra 32 bits whenever they can and negate the performance
> > > improvement of the first cache line.
> > 
> > I would tend to agree, and I don't really see any convenient way to
> > avoid putting in a 64-bit field for the timestamp in cache-line 0. If
> > we are ok with having this overlap/partially overlap with sequence
> > number, it will use up an extra 4B of storage in that cacheline.
> 
> I agree about the lack of convenience! And Adrien certainly has a point about PMD temptations.
> 
> However, I still don't think that a NICs ability to date-stamp a packet is sufficient reason to put a date-stamp in cache line 0 of the mbuf. Storing only the fast moving 32 bit in cache line 0 seems like a good compromise to me.
> 
> Maybe you can find just one more byte, so it can hold 17 minutes with nanosecond resolution. (I'm joking!)
> 
> Please don't sacrifice the sequence number for the seconds/hours/days part a timestamp. Maybe it could be configurable to use a 32 bit or 64 bit timestamp.
> 
Do you see both timestamp and sequence numbers being used together? I
would have thought that apps would either use one or the other? However,
your suggestion is workable in any case, to allow the sequence number to
overlap just the high 32 bits of the timestamp, rather than the low.

/Bruce


More information about the dev mailing list