[dpdk-dev] mbuf changes
    Morten Brørup 
    mb at smartsharesystems.com
       
    Fri Oct 28 22:27:37 CEST 2016
    
    
  
> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Richardson, Bruce
> Sent: Friday, October 28, 2016 7:01 PM
> To: Adrien Mazarguil; Morten Brørup
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] mbuf changes
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Adrien Mazarguil
> > Sent: Friday, October 28, 2016 5:50 PM
> > To: Morten Brørup <mb at smartsharesystems.com>
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> > On Fri, Oct 28, 2016 at 04:11:45PM +0200, Morten Brørup wrote:
> > > Comments at the end.
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Pattan,
> > > > Reshma
> > > > Sent: Friday, October 28, 2016 3:35 PM
> > > > To: Olivier Matz
> > > > Cc: dev at dpdk.org; Morten Brørup
> > > > Subject: Re: [dpdk-dev] mbuf changes
> > > >
> > > > Hi Olivier,
> > > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier
> > > > > Matz
> > > > > Sent: Tuesday, October 25, 2016 1:49 PM
> > > > > To: Richardson, Bruce <bruce.richardson at intel.com>; Morten
> > > > > Brørup <mb at smartsharesystems.com>
> > > > > Cc: Adrien Mazarguil <adrien.mazarguil at 6wind.com>; Wiles, Keith
> > > > > <keith.wiles at intel.com>; dev at dpdk.org; Oleg Kuporosov
> > > > > <olegk at mellanox.com>
> > > > > Subject: Re: [dpdk-dev] mbuf changes
> > > > >
> > > > >
> > > > >
> > > > > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > > > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > > > > >> Comments at the end.
> > > > > >>
> > > > > >> Med venlig hilsen / kind regards
> > > > > >> - Morten Brørup
> > > > > >>
> > > > > >>> -----Original Message-----
> > > > > >>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> > > > > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > > > > >>> To: Morten Brørup
> > > > > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier
> > > > > >>> Matz; Oleg Kuporosov
> > > > > >>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>
> > > > > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > > > > >>>> Comments inline.
> > > > > >>>>
> > > > > >>>>> -----Original Message-----
> > > > > >>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > > >>>>> Richardson
> > > > > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > > > > >>>>> To: Adrien Mazarguil
> > > > > >>>>> Cc: Morten Brørup; Wiles, Keith; dev at dpdk.org; Olivier
> > > > > >>>>> Matz; Oleg Kuporosov
> > > > > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>>>
> > > > > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> > > > wrote:
> > > > > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup
> > wrote:
> > > > > >>>>>>> Comments inline.
> > > > > >>>>>>>
> > > > > >>>>>>> Med venlig hilsen / kind regards
> > > > > >>>>>>> - Morten Brørup
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> -----Original Message-----
> > > > > >>>>>>>> From: Adrien Mazarguil
> > > > > >>>>>>>> [mailto:adrien.mazarguil at 6wind.com]
> > > > > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > > > > >>>>>>>> To: Bruce Richardson
> > > > > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev at dpdk.org; Olivier
> > > > > >>>>>>>> Matz; Oleg Kuporosov
> > > > > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce
> > > > > >>>>>>>> Richardson
> > > > > >>> wrote:
> > > > > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > > > > >>> wrote:
> > > > > >>>>>>>> [...]
> > > > > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > > > > >>>>>>>> <mb at smartsharesystems.com> wrote:
> > > > > >>>>>>>> [...]
> > > > > >>>>
> > > > > >>>>>>>>> One other point I'll mention is that we need to have a
> > > > > >>>>>>>>> discussion on how/where to add in a timestamp value
> > > > > >>>>>>>>> into
> > > > > >>> the
> > > > > >>>>>>>>> mbuf. Personally, I think it can be in a union with
> > > > > >>>>>>>>> the
> > > > > >>>>> sequence
> > > > > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > > > > >>> timestamp
> > > > > >>>>>>>>> is not going to be enough for
> > > > > >>>>>>>> many.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Thoughts?
> > > > > >>>>>>>>
> > > > > >>>>>>>> If we consider that timestamp representation should use
> > > > > >>>>> nanosecond
> > > > > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > > > > >>> quickly
> > > > > >>>>>>>> to be useful. We can also assume that applications
> > > > requesting
> > > > > >>>>>>>> timestamps may care more about latency than throughput,
> > > > > >>>>>>>> Oleg
> > > > > >>>>> found
> > > > > >>>>>>>> that using the second cache line for this purpose had a
> > > > > >>>>> noticeable impact [1].
> > > > > >>>>>>>>
> > > > > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> > > > October/049237.html
> > > > > >>>>>>>
> > > > > >>>>>>> I agree with Oleg about the latency vs. throughput
> > > > > >>>>>>> importance for
> > > > > >>>>> such applications.
> > > > > >>>>>>>
> > > > > >>>>>>> If you need high resolution timestamps, consider them to
> > > > > >>>>>>> be
> > > > > >>>>> generated by the NIC RX driver, possibly by the hardware
> > > > > >>>>> itself
> > > > > >>>>> (http://w3new.napatech.com/features/time-precision/hardwar
> > > > > >>>>> e-
> > > > time
> > > > > >>>>> - stamp), so the timestamp belongs in the first cache line.
> > > > > >>>>> And I am proposing that it should have the highest
> > > > > >>>>> possible accuracy, which makes the value hardware dependent.
> > > > > >>>>>>>
> > > > > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > > > > >>> application
> > > > > >>>>>>> to
> > > > > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > > > > >>>>> seconds, hours and calendar date) out of band, so we don't
> > > > > >>>>> use precious
> > > > > >>> space
> > > > > >>>>> in the mbuf. The application doesn't need the NIC RX
> > > > > >>>>> driver's fast path to capture which date (or even which
> > > > > >>>>> second) a packet was received. Yes, it adds complexity to
> > > > > >>>>> the application, but
> > > > we
> > > > > >>>>> can't set aside 64 bit for a generic timestamp. Or as a
> > > > > >>>>> weird
> > > > tradeoff:
> > > > > >>>>> Put the fast moving 32 bit in the first cache line and the
> > > > > >>>>> slow moving 32 bit in the second cache line, as a
> > > > > >>>>> placeholder for
> > > > the
> > > > > >>> application to fill out if needed.
> > > > > >>>>> Yes, it means that the application needs to check the time
> > > > > >>>>> and update its variable holding the slow moving time once
> > > > > >>>>> every second or so; but that should be doable without
> > > > > >>>>> significant
> > > > effort.
> > > > > >>>>>>
> > > > > >>>>>> That's a good point, however without a 64 bit value,
> > > > > >>>>>> elapsed time between two arbitrary mbufs cannot be
> > > > > >>>>>> measured reliably due to
> > > > > >>> not
> > > > > >>>>>> enough context, one way or another the low resolution
> > > > > >>>>>> value is also
> > > > > >>>>> needed.
> > > > > >>>>>>
> > > > > >>>>>> Obviously latency-sensitive applications are unlikely to
> > > > > >>>>>> perform lengthy buffering and require this but I'm not
> > > > > >>>>>> sure about all the possible use-cases. Considering many
> > > > > >>>>>> NICs expose
> > > > > >>>>>> 64 bit
> > > > > >>> timestaps,
> > > > > >>>>>> I suggest we do not truncate them.
> > > > > >>>>>>
> > > > > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > > > > >>>>>> tempted to fill the extra 32 bits whenever they can and
> > > > > >>>>>> negate the performance improvement of the first cache line.
> > > > > >>>>>
> > > > > >>>>> I would tend to agree, and I don't really see any
> > > > > >>>>> convenient
> > > > way
> > > > > >>>>> to avoid putting in a 64-bit field for the timestamp in
> > > > > >>>>> cache-
> > > > line 0.
> > > > > >>>>> If we are ok with having this overlap/partially overlap
> > > > > >>>>> with sequence number, it will use up an extra 4B of
> > > > > >>>>> storage in that
> > > > > >>> cacheline.
> > > > > >>>>
> > > > > >>>> I agree about the lack of convenience! And Adrien certainly
> > > > > >>>> has
> > > > a
> > > > > >>> point about PMD temptations.
> > > > > >>>>
> > > > > >>>> However, I still don't think that a NICs ability to
> > > > > >>>> date-stamp a
> > > > > >>> packet is sufficient reason to put a date-stamp in cache
> > > > > >>> line
> > > > > >>> 0
> > > > of
> > > > > >>> the mbuf. Storing only the fast moving 32 bit in cache line
> > > > > >>> 0 seems like a good compromise to me.
> > > > > >>>>
> > > > > >>>> Maybe you can find just one more byte, so it can hold 17
> > > > > >>>> minutes with nanosecond resolution. (I'm joking!)
> > > > > >>>>
> > > > > >>>> Please don't sacrifice the sequence number for the
> > > > > >>>> seconds/hours/days
> > > > > >>> part a timestamp. Maybe it could be configurable to use a 32
> > > > > >>> bit or
> > > > > >>> 64 bit timestamp.
> > > > > >>>>
> > > > > >>> Do you see both timestamp and sequence numbers being used
> > > > together?
> > > > > >>> I would have thought that apps would either use one or the
> > other?
> > > > > >>> However, your suggestion is workable in any case, to allow
> > > > > >>> the sequence number to overlap just the high 32 bits of the
> > > > timestamp,
> > > > > >>> rather than the low.
> > > > > >>
> > > > > >> In our case, I can foresee sequence numbers used for packet
> > > > > >> processing and
> > > > > timestamps for timing analysis (and possibly for packet
> > > > > capturing, when being used). For timing analysis, we don’t need
> > > > > long durations, e.g. 4 seconds with 32 bit nanosecond resolution
> > > > > suffices. And for packet capturing we are perfectly capable of
> > > > > adding the slowly moving
> > > > > 32 bit of the timestamp to our output data stream without
> > > > > fetching it
> > > > from the mbuf.
> > > > > >>
> > > > >
> > > > > We should keep in mind that today we have the seqn field but it
> > > > > is
> > > > not
> > > > > used by any PMD. In case it is implemented, would it be a
> > > > > per-queue
> > > > sequence number?
> > > > > Is it useful from an application view?
> > > > >
> > > > > This field is only used by the librte_reorder library, and in my
> > > > > opinion, we should consider moving it in the second cache line
> > > > > since
> > > > it is not filled by the PMD.
> > > > >
> > > > >
> > > > > > For the 32-bit timestamp case, it might be useful to have a
> > > > > > right-shift value passed in to the ethdev driver. If we assume
> > > > > > a
> > > > NIC
> > > > > > with nanosecond resolution, (or TSC value with resolution of
> > > > > > that order of magnitude), then the app can choose to have 1 ns
> > > > resolution
> > > > > > with 4 second wraparound, or alternatively 4ns resolution with
> > > > > > 16 second wraparound, or even microsecond resolution with wrap
> > > > > > around of over
> > > > > an hour.
> > > > > > The cost is obviously just a shift op in the driver code per
> > > > > > packet
> > > > > > - hopefully with multiple packets done at a time using vector
> > > > operations.
> > > > >
> > > > >
> > > > > About the timestamp, we can manage to find 64 bits in the first
> > > > > cache line, without sacrifying any field we have today. The
> > > > > question is
> > > > more
> > > > > for the fields we may want to add later.
> > > > >
> > > > > To answer to the question of the size of the timestamp, the
> > > > > first question is to know what is the precision required for the
> > > > applications using it?
> > > >
> > > > As part of the 17.02 latency calculation feature, the requirement
> > > > is to report latency in nanoseconds. So, +1 on keeping timestamp
> > > > as
> > 64bits.
> > > > Since the feature is planned for 17.02, can we finalize on the
> > > > timestamp position in the mbuf struct?
> > > > Based on the decision, I am planning to make the change and send
> > > > ABI notice if needed.
> > > >
> > > > Reshma
> > > >
> > > > >
> > > > > I don't quite like the idea of splitting the timestamp in the 2
> > > > > cache lines, I think it would not be easy to use.
> > > > >
> > > > >
> > > > > Olivier
> > >
> > > Nanosecond precision in latency calculations does not require more
> > > than
> > 32 bit, unless we must also be able to measure more than 4 seconds of
> > latency. And please consider how many packets we would need to hold in
> > memory to keep track of 4 seconds of traffic on a 100 Gbit/s link (if
> > we are measuring network latency): With an average packet size of 300
> > B that would be 40 million packets.
> >
> > Consider another, perhaps more realistic use-case: keeping a few
> > relevant packets from a stream for later analysis, which could occur
> > outside of the normal TX processing path for performance reasons. In
> > this case 4 seconds worth of context may not be enough to avoid losing
> information.
Agreed. And I would say that this exception path can add the higher 32 bit itself; it is not required by the PMD or even in the mbuf.
> >
> > > If I recall correctly, the 17.02 feature is about measuring
> > > application
> > latency, not network latency (as my calculation above is about). I
> > don't consider application latency measuring a common feature for most
> > applications, except for application debugging/profiling purposes (not
> > network debugging purposes). So do we really need both nanosecond
> > precision and the ability to measure 4+ seconds of latency?
> >
> > Existing HW can already provide 64-bit precision, why truncate part of
> it?
> > The only valid reason would be that there is not enough room in the
> > mbuf header, which is not the case. Looks like today there is even
> > enough room in the first 64 bytes.
> >
> > As suggested by Bruce, we can even union (part of) this field with
> > another if necessary, as long as we make sure both cannot be requested
> > at the same time (determining which field should be sacrificed is
> > going to be tricky though).
It would make some applications easier to develop if the mbuf contained a 64 bit timestamp; but I don't think these are very common applications. It's a tradeoff, and I'm advocating for conserving the very precious space in the mbuf, at the cost of requiring some extra work in these uncommon applications.
> >
> > > Furthermore, if the timestamp isn't set by hardware, why not just
> > > take
> > the entire array of packets pulled out from the NIC and set their
> > timestamp to the same value, instead of setting them one by one in the
> > PMD
> > - that would also consolidate the expense of obtaining the time from
> > some clock source.
> >
> > Like any offload, some PMD processing is expected to convert the
> > possibly HW-specific value to a common mbuf format, however if
> > hardware does not report any kind of information relevant to timestamp
> > individual packets, I do not think the PMD should expose that feature
> > and implement it in software. It would be much worse, certainly not
> > what the application wanted as it could have done the same on its own.
Agree. And I would go so far as to say that this should be a guiding principle for DPDK. If the NIC HW doesn't support something, the PMD shouldn't try to emulate it in software; instead, DPDK could provide various libraries to offer these features in software, like the packet type library mentioned by Olivier at the recent DPDK Userspace conference.
> >
> > What matters is to know when NIC receives a packet, not when the
> > application fetches it. You should assume there is a significant delay
> > between these two events, hence the need for it to be offloaded.
Agree.
> >
> 
> I would actually see both use cases being used. In some cases, NICs will
> have hardware offload capability for timestamping which can be used, while
> in many cases, timestamping on retrieval by the core will suffice for
> latency measurements within an app. I don't think we can discount either
> possibility. Thankfully, AFAIK which model is used doesn't really affect
> whether we need 32 or 64 bits for the data.
Agreed about 32/64 bit choice not being affected by this.
But take note of Adrien's comment about the potential delay from the packet arrives on the NIC until the application gets round to calling the PMD to fetch it. Software based timestamps may give a false impression about the application latency. Assume that the application polls the PMD and bulk fetches all the packets, then proceeds to processing them, and then starts over. If a packet arrives just after the first batch was fetched, a software timestamp in the PMD will set the time when the application starts fetching the next batch of packets, ignoring all the time the packet spent waiting for being fetched while the application processed the first batch of packets.
> 
> /Bruce
-Morten
    
    
More information about the dev
mailing list