[dpdk-dev] [RFC v2] doc compression API for DPDK

Trahe, Fiona fiona.trahe at intel.com
Wed Feb 21 15:35:27 CET 2018


Hi Ahmed, Shally,


> -----Original Message-----
> From: Ahmed Mansour [mailto:ahmed.mansour at nxp.com]
> Sent: Tuesday, February 20, 2018 7:56 PM
> To: Verma, Shally <Shally.Verma at cavium.com>; Trahe, Fiona <fiona.trahe at intel.com>; dev at dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya at cavium.com>; Gupta, Ashish
> <Ashish.Gupta at cavium.com>; Sahu, Sunila <Sunila.Sahu at cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch at intel.com>; Challa, Mahipal <Mahipal.Challa at cavium.com>; Jain, Deepak K
> <deepak.k.jain at intel.com>; Hemant Agrawal <hemant.agrawal at nxp.com>; Roy Pledge
> <roy.pledge at nxp.com>; Youri Querry <youri.querry_1 at nxp.com>
> Subject: Re: [RFC v2] doc compression API for DPDK
> 
> /// snip ///
> >>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
> >>>>>>>>>>>>>>> ---------------------------------------------------------------
> >>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
> >>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
> >>>>>>>>>>>> enqueue
> >>>>>>>>>>>>>> it for stateless processing.
> >>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
> >>>>>>>>>>>> an
> >>>>>>>>>>>>>> expected call flow would be something like:
> >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >>>>>>>>>>>> this illustration is specifically in
> >>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
> >>>>>>>>>>>> chunks, then each chunk should be
> >>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
> >>>>>>>>>>>> dequeued first before next chunk is
> >>>>>>>>>>>>> enqueued.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
> >>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >>>>>>>>>>>> distinguish
> >>>>>>>>>>>>>> the response in exception cases?
> >>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >>>>>>>>>>>> such case is independent of
> >>>>>>>>>>>>> each other i.e. belong to different streams altogether.
> >>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >>>>>>>>>>>> related chunks of data in single
> >>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
> >>>>>>>>>>>> PMD handling for various
> >>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
> >>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >>>>>>>>>>>> time, since each needs the
> >>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
> >>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
> >>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
> >>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
> >>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> >>>>>>>>>> complex but doable.
> >>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
> >>>>>>>>> There may be many streams.
> >>>>>>>>> The PMD would have to have an internal queue per stream so
> >>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> >>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
> >>>>>>>>> source len is increased and its dst buffer is not big enough.
> >>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
> >>>>>>>> The caller would still need to adjust
> >>>>>>>> the src length/output buffer as you say. The PMD cannot handle
> >>>>>>>> OUT_OF_SPACE internally.
> >>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
> >>>>>>>> until it gets explicit
> >>>>>>>> confirmation from the caller to continue working on this stream. Any ops
> >>>>>>>> received by
> >>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
> >>>>>>>> the caller did not
> >>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
> >>>>>>>> These semantics can be enabled by adding a new function to the API
> >>>>>>>> perhaps stream_resume().
> >>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
> >>>>>>>> the issue and this op
> >>>>>>>> should be used to resolve the issue. Implementations that do not support
> >>>>>>>> this mode of use
> >>>>>>>> can push back immediately after one op is in flight. Implementations
> >>>>>>>> that support this use
> >>>>>>>> mode can allow many ops from the same session
> >>>>>>>>
> >>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
> would
> >>>> still
> >>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
> think
> >>>> is
> >>>>>>> the case with DPDK PMD).
> >>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
> above
> >>>> PMD
> >>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
> output
> >>>> and
> >>>>>>> pass than as one op.
> >>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
> >>>>>>> impacting aspect also?
> >>>>>>>
> >>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
> >>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
> consumed
> >>>> + 1
> >>>>>>> from next call onwards as it has already
> >>>>>>> seen OUT_OF_SPACE.
> >>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
> >>>>> implementation rejects all ops that belong to a stream that has entered
> >>>>> "RECOVERABLE" state for one reason or another. The caller must
> >>>>> acknowledge explicitly that it has received news of the problem before
> >>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
> >>>>> that implementing this functionality in the software layer above the PMD
> >>>>> is a bad idea since the latency reductions are lost.
> >>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
> in a
> >>>> layer above PMD.
> >>>>
> >>>>> This setup is useful in latency sensitive applications where the latency
> >>>>> of buffering multiple ops into one op is significant. We found latency
> >>>>> makes a significant difference in search applications where the PMD
> >>>>> competes with software decompression.
> >>> [Fiona] I see, so when all goes well, you get best-case latency, but when
> >>> out-of-space occurs latency will probably be worse.
> >> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
> >> rare occurrence. Recovering from it should take similar time to
> >> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
> >> both sync and async use. The caller can fix up the op and send it back
> >> to the PMD to continue work just as would be done in sync. Nonetheless,
> >> the added complexity is not justifiable if out-of-space is very common
> >> since the recoverable state will be the limiting factor that forces
> >> synchronicity.
> >>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
> >>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
> >>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
> >>>>>> subsequent ops in that stream.
> >>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
> >>>>>> Or somehow drop them? How?
> >>>>>> While still processing ops form other streams.
> >>>>> [Ahmed] This is exactly correct. It should return them with
> >>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
> >>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
> >>>>>> would not want to open up each op to see which stream it's attached to and
> >>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
> processing.
> >>>>> [Ahmed] I think I might have missed your point here, but I will try to
> >>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
> >>>>> to the PMD and the PMD should reject until stream_continue() is called.
> >>>>> The next op to be sent by the user will have a special marker in it to
> >>>>> inform the PMD to continue working on this stream. Alternatively the
> >>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
> >>>>> the stream and its state, but like you say this adds additional CPU
> >>>>> overhead during the enqueue.
> >>>>> I am curious. In a simple synchronous use case. How do we prevent users
> >>>> >from putting multiple ops in flight that belong to a single stream? Do
> >>>>> we just currently say it is undefined behavior? Otherwise we would have
> >>>>> to check the stream and incur the CPU overhead.
> >>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
> >>> DPDK model we expect good behaviour and don't have to error check for things like this.
> >> [Ahmed] This makes sense. We also assume good behavior.
> >>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
> >>> build and send those messages. If we found an op from a stream which already
> >>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
> >>> only send 19 to hw. We cannot send multiple ops from same stream to
> >>> the hw as it fans them out and does them in parallel.
> >>> Once the enqueue_burst() returns, there is no processing
> >>> context which would spot that the first has completed
> >>> and send the next op to the hw. On a dequeue_burst() we would spot this,
> >>> in that context could process the next op in the stream.
> >>> On out of space, instead of processing the next op we would have to transfer
> >>> all unprocessed ops from the stream to the dequeue result.
> >>> Some parts of this are doable, but seems likely to add a lot more latency,
> >>> we'd need to add extra threads and timers to move ops from the sw
> >>> queue to the hw q to get any benefit, and these constructs would add
> >>> context switching and CPU cycles. So we prefer to push this responsibility
> >>> to above the API and it can achieve similar.
> >> [Ahmed] I see what you mean. Our workflow is almost exactly the same
> >> with our hardware, but the fanning out is done by the hardware based on
> >> the stream and ops that belong to the same stream are never allowed to
> >> go out of order. Otherwise the data would be corrupted. Likewise the
> >> hardware is responsible for checking the state of the stream and
> >> returning frames as NOT_PROCESSED to the software
> >>>>>> Maybe we could add a capability if this behaviour is important for you?
> >>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
> >>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
> >>>>>> to be in flight at any time.
> >>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
> >>>>> not have to add extra checking for unsupported cases.
> >>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
> >>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
> note,
> >>>> here burst is not carrying more than one stream)
> >> [Ahmed] No. In this use case the caller sets up an op and enqueues a
> >> single op. Then before the response comes back from the PMD the caller
> >> enqueues a second op on the same stream.
> >>>> -PMD will submit one op at a time to HW?
> >> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
> >> HW. I used DPDK to mean the software implementation that talks to the
> >> hardware.
> >> The software will submit all ops immediately. The hardware has to figure
> >> out what to do with the ops depending on what stream they belong to.
> >>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
> >>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
> >>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
> count
> >>>> = total # of ops submitted originally with burst?
> >> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
> >> the HW. The HW will put all of them on the completion queue with the
> >> correct status exactly as you say.
> >>>> -app assumes all have been enqueued, so it go and dequeue all ops
> >>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
> >>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
> >>>> NOT_PROCESSED with updated input and output buffer?
> >> [Ahmed] Correct this is what we do today in our proprietary API.
> >>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
> >>>> failure is seen, then app start whole processing all over again or just drop this burst?!
> >> [Ahmed] The app has the choice on how to proceed. If the issue is
> >> recoverable then the application can continue this stream from where it
> >> stopped. if the failure is unrecoverable then the application should
> >> first fix the problem and start from the beginning of the stream.
> >>>> If all of above is true, then I think we should add another API such as
> rte_comp_enque_single_stream()
> >>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
> >>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
> >> [Ahmed] The main advantage in async use is lost if we force all related
> >> ops to be in the same burst. if we do that, then we might as well merge
> >> all the ops into one op. That would reduce the overhead.
> >> The use mode I am proposing is only useful in cases where the data
> >> becomes available after the first enqueue occurred. I want to allow the
> >> caller to enqueue the second set of data as soon as it is available
> >> regardless of whether or not the HW has already started working on the
> >> first op inflight.
> > [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
> >
> > As per current description in doc, expected stateful usage is:
> > enqueue (op1) --> dequeue(op1) --> enqueue(op2)
> >
> > but you're suggesting to allow an option to change it to
> >
> > enqueue(op1) -->enqueue(op2)
> >
> > i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
> still carry multiple streams where each op belonging to different stream ?!
> [Ahmed] Correct. I guess a user could put two ops on the same burst that
> belong to the same stream. In that case it would be more efficient to
> merge the ops using scatter gather. Nonetheless, I would not add checks
> in my implementation to limit that use. The hardware does not perceive a
> difference between ops that came on one burst and ops that came on two
> different bursts. to the hardware they are all ops. What matters is
> which stream each op belongs to.
> > if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
> enable it under some feature flag.
> >
> > However it looks like an add-on and if it doesn't change current definition of a burst and minimum
> expectation set on stateful processing described in this document, then IMO, you can propose this feature
> as an incremental patch on baseline version, in absence of which,
> > application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
> [Ahmed] Makes sense. I was worried that there might be fundamental
> limitations to this mode of use in the API design. That is why I wanted
> to share this use mode with you guys and see if it can be accommodated
> using an incremental patch in the future.
> >>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
> stream
> >>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
> dequeue_burst()
> >>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
> ops
> >>> had been processed? This would turn it into a synchronous call which isn't the intent.
> >> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
> >> sits the hardware does not fundamentally change the parameters of the
> >> system as a whole. It just moves workflow management complexity down
> >> into the DPDK software layer. Rather there are real latency and
> >> throughput advantages (because of caching) that I want to expose.
> >>
[Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
PMDs and enabled under an explicit capability - named something like
ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
@Ahmed, we'll leave it up to you to define details.
What's necessary is API text to describe the expected behaviour on any error conditions,
the pause/resume API, whether an API is expected to clean up if resume doesn't happen
and if there's any time limit on this, etc
But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
must be able to handle the default behaviour, i.e. with this capability disabled.
Specifically even if a PMD has this capability, if an application ignores it and only sends
one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
not be in a paused state and the PMD should not wait for a resume() to handle the 
next op sent for that stream.
Does that make sense?

> >> /// snip ///
> >
> >



More information about the dev mailing list