[dpdk-dev] [RFC v2] doc compression API for DPDK
    Ahmed Mansour 
    ahmed.mansour at nxp.com
       
    Wed Feb 21 20:35:35 CET 2018
    
    
  
On 2/21/2018 9:35 AM, Trahe, Fiona wrote:
> Hi Ahmed, Shally,
>
>
>> -----Original Message-----
>> From: Ahmed Mansour [mailto:ahmed.mansour at nxp.com]
>> Sent: Tuesday, February 20, 2018 7:56 PM
>> To: Verma, Shally <Shally.Verma at cavium.com>; Trahe, Fiona <fiona.trahe at intel.com>; dev at dpdk.org
>> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya at cavium.com>; Gupta, Ashish
>> <Ashish.Gupta at cavium.com>; Sahu, Sunila <Sunila.Sahu at cavium.com>; De Lara Guarch, Pablo
>> <pablo.de.lara.guarch at intel.com>; Challa, Mahipal <Mahipal.Challa at cavium.com>; Jain, Deepak K
>> <deepak.k.jain at intel.com>; Hemant Agrawal <hemant.agrawal at nxp.com>; Roy Pledge
>> <roy.pledge at nxp.com>; Youri Querry <youri.querry_1 at nxp.com>
>> Subject: Re: [RFC v2] doc compression API for DPDK
>>
>> /// snip ///
>>>>>>>>>>>>>>>>> D.2.1 Stateful operation state maintenance
>>>>>>>>>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>>>>>>>>> It is always an ideal expectation from application that it should parse
>>>>>>>>>>>>>>>> through all related chunk of source data making its mbuf-chain and
>>>>>>>>>>>>>> enqueue
>>>>>>>>>>>>>>>> it for stateless processing.
>>>>>>>>>>>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> expected call flow would be something like:
>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
>>>>>>>>>>>>>>>> burst in a loop until all ops are received. Is this correct?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
>>>>>>>>>>>>>> this illustration is specifically in
>>>>>>>>>>>>>>> context of stateful op processing to reflect if a stream is broken into
>>>>>>>>>>>>>> chunks, then each chunk should be
>>>>>>>>>>>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
>>>>>>>>>>>>>> dequeued first before next chunk is
>>>>>>>>>>>>>>> enqueued.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> enqueue_burst( |op.no_flush |)
>>>>>>>>>>>>>>>>> deque_burst(op) // should dequeue before we enqueue next
>>>>>>>>>>>>>>>>> enqueue_burst( |op.full_flush |)
>>>>>>>>>>>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
>>>>>>>>>>>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
>>>>>>>>>>>>>> distinguish
>>>>>>>>>>>>>>>> the response in exception cases?
>>>>>>>>>>>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
>>>>>>>>>>>>>> such case is independent of
>>>>>>>>>>>>>>> each other i.e. belong to different streams altogether.
>>>>>>>>>>>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
>>>>>>>>>>>>>> related chunks of data in single
>>>>>>>>>>>>>>> burst by passing them as ops array but later found that as not-so-useful for
>>>>>>>>>>>>>> PMD handling for various
>>>>>>>>>>>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
>>>>>>>>>>>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
>>>>>>>>>>>>>> time, since each needs the
>>>>>>>>>>>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
>>>>>>>>>>>>>> force PMDs to implement internal queueing and exception handling for
>>>>>>>>>>>>>> OUT_OF_SPACE conditions you mention.
>>>>>>>>>>>> [Ahmed] But we are putting the ops on qps which would make them
>>>>>>>>>>>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
>>>>>>>>>>>> complex but doable.
>>>>>>>>>>> [Fiona] In my opinion this is not doable, could be very inefficient.
>>>>>>>>>>> There may be many streams.
>>>>>>>>>>> The PMD would have to have an internal queue per stream so
>>>>>>>>>>> it could adjust the next src offset and length in the OUT_OF_SPACE case.
>>>>>>>>>>> And this may ripple back though all subsequent ops in the stream as each
>>>>>>>>>>> source len is increased and its dst buffer is not big enough.
>>>>>>>>>> [Ahmed] Regarding multi op OUT_OF_SPACE handling.
>>>>>>>>>> The caller would still need to adjust
>>>>>>>>>> the src length/output buffer as you say. The PMD cannot handle
>>>>>>>>>> OUT_OF_SPACE internally.
>>>>>>>>>> After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
>>>>>>>>>> until it gets explicit
>>>>>>>>>> confirmation from the caller to continue working on this stream. Any ops
>>>>>>>>>> received by
>>>>>>>>>> the PMD should be returned to the caller with status STREAM_PAUSED since
>>>>>>>>>> the caller did not
>>>>>>>>>> explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
>>>>>>>>>> These semantics can be enabled by adding a new function to the API
>>>>>>>>>> perhaps stream_resume().
>>>>>>>>>> This allows the caller to indicate that it acknowledges that it has seen
>>>>>>>>>> the issue and this op
>>>>>>>>>> should be used to resolve the issue. Implementations that do not support
>>>>>>>>>> this mode of use
>>>>>>>>>> can push back immediately after one op is in flight. Implementations
>>>>>>>>>> that support this use
>>>>>>>>>> mode can allow many ops from the same session
>>>>>>>>>>
>>>>>>>>> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I
>> would
>>>>>> still
>>>>>>>>> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I
>> think
>>>>>> is
>>>>>>>>> the case with DPDK PMD).
>>>>>>>>> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer
>> above
>>>>>> PMD
>>>>>>>>> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and
>> output
>>>>>> and
>>>>>>>>> pass than as one op.
>>>>>>>>> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
>>>>>>>>> impacting aspect also?
>>>>>>>>>
>>>>>>>>> if it is in context where each op belong to different stream in a burst, then why do we need
>>>>>>>>> stream_pause and resume? It is a expectations from app to pass more output buffer with
>> consumed
>>>>>> + 1
>>>>>>>>> from next call onwards as it has already
>>>>>>>>> seen OUT_OF_SPACE.
>>>>>>> [Ahmed] Yes, this would add extra overhead to the PMD. Our PMD
>>>>>>> implementation rejects all ops that belong to a stream that has entered
>>>>>>> "RECOVERABLE" state for one reason or another. The caller must
>>>>>>> acknowledge explicitly that it has received news of the problem before
>>>>>>> the PMD allows this stream to exit "RECOVERABLE" state. I agree with you
>>>>>>> that implementing this functionality in the software layer above the PMD
>>>>>>> is a bad idea since the latency reductions are lost.
>>>>>> [Shally] Just reiterating, I rather meant other way around i.e. I see it easier to put all such complexity
>> in a
>>>>>> layer above PMD.
>>>>>>
>>>>>>> This setup is useful in latency sensitive applications where the latency
>>>>>>> of buffering multiple ops into one op is significant. We found latency
>>>>>>> makes a significant difference in search applications where the PMD
>>>>>>> competes with software decompression.
>>>>> [Fiona] I see, so when all goes well, you get best-case latency, but when
>>>>> out-of-space occurs latency will probably be worse.
>>>> [Ahmed] This is exactly right. This use mode assumes out-of-space is a
>>>> rare occurrence. Recovering from it should take similar time to
>>>> synchronous implementations. The caller gets OUT_OF_SPACE_RECOVERABLE in
>>>> both sync and async use. The caller can fix up the op and send it back
>>>> to the PMD to continue work just as would be done in sync. Nonetheless,
>>>> the added complexity is not justifiable if out-of-space is very common
>>>> since the recoverable state will be the limiting factor that forces
>>>> synchronicity.
>>>>>>>> [Fiona] I still have concerns with this and would not want to support in our PMD.
>>>>>>>> TO make sure I understand, you want to send a burst of ops, with several from same stream.
>>>>>>>> If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any
>>>>>>>> subsequent ops in that stream.
>>>>>>>> Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
>>>>>>>> Or somehow drop them? How?
>>>>>>>> While still processing ops form other streams.
>>>>>>> [Ahmed] This is exactly correct. It should return them with
>>>>>>> NOT_PROCESSED. Yes, the PMD should continue processing other streams.
>>>>>>>> As we want to offload each op to hardware with as little CPU processing as possible we
>>>>>>>> would not want to open up each op to see which stream it's attached to and
>>>>>>>> make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without
>> processing.
>>>>>>> [Ahmed] I think I might have missed your point here, but I will try to
>>>>>>> answer. There is no need to "cushion" ops in DPDK. DPDK should send ops
>>>>>>> to the PMD and the PMD should reject until stream_continue() is called.
>>>>>>> The next op to be sent by the user will have a special marker in it to
>>>>>>> inform the PMD to continue working on this stream. Alternatively the
>>>>>>> DPDK layer can be made "smarter" to fail during the enqueue by checking
>>>>>>> the stream and its state, but like you say this adds additional CPU
>>>>>>> overhead during the enqueue.
>>>>>>> I am curious. In a simple synchronous use case. How do we prevent users
>>>>>> >from putting multiple ops in flight that belong to a single stream? Do
>>>>>>> we just currently say it is undefined behavior? Otherwise we would have
>>>>>>> to check the stream and incur the CPU overhead.
>>>>> [Fiona] We don't do anything to prevent it. It's undefined. IMO on data path in
>>>>> DPDK model we expect good behaviour and don't have to error check for things like this.
>>>> [Ahmed] This makes sense. We also assume good behavior.
>>>>> In our PMD if we got a burst of 20 ops, we allocate 20 spaces on the hw q, then
>>>>> build and send those messages. If we found an op from a stream which already
>>>>> had one inflight, we'd have to hold that back, store in a sw stream-specific holding queue,
>>>>> only send 19 to hw. We cannot send multiple ops from same stream to
>>>>> the hw as it fans them out and does them in parallel.
>>>>> Once the enqueue_burst() returns, there is no processing
>>>>> context which would spot that the first has completed
>>>>> and send the next op to the hw. On a dequeue_burst() we would spot this,
>>>>> in that context could process the next op in the stream.
>>>>> On out of space, instead of processing the next op we would have to transfer
>>>>> all unprocessed ops from the stream to the dequeue result.
>>>>> Some parts of this are doable, but seems likely to add a lot more latency,
>>>>> we'd need to add extra threads and timers to move ops from the sw
>>>>> queue to the hw q to get any benefit, and these constructs would add
>>>>> context switching and CPU cycles. So we prefer to push this responsibility
>>>>> to above the API and it can achieve similar.
>>>> [Ahmed] I see what you mean. Our workflow is almost exactly the same
>>>> with our hardware, but the fanning out is done by the hardware based on
>>>> the stream and ops that belong to the same stream are never allowed to
>>>> go out of order. Otherwise the data would be corrupted. Likewise the
>>>> hardware is responsible for checking the state of the stream and
>>>> returning frames as NOT_PROCESSED to the software
>>>>>>>> Maybe we could add a capability if this behaviour is important for you?
>>>>>>>> e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
>>>>>>>> Our PMD would set this to 0. And expect no more than one op from a stateful stream
>>>>>>>> to be in flight at any time.
>>>>>>> [Ahmed] That makes sense. This way the different DPDK implementations do
>>>>>>> not have to add extra checking for unsupported cases.
>>>>>> [Shally] @ahmed, If I summarise your use-case, this is how to want to PMD to support?
>>>>>> - a burst *carry only one stream* and all ops then assumed to be belong to that stream? (please
>> note,
>>>>>> here burst is not carrying more than one stream)
>>>> [Ahmed] No. In this use case the caller sets up an op and enqueues a
>>>> single op. Then before the response comes back from the PMD the caller
>>>> enqueues a second op on the same stream.
>>>>>> -PMD will submit one op at a time to HW?
>>>> [Ahmed] I misunderstood what PMD means. I used it throughout to mean the
>>>> HW. I used DPDK to mean the software implementation that talks to the
>>>> hardware.
>>>> The software will submit all ops immediately. The hardware has to figure
>>>> out what to do with the ops depending on what stream they belong to.
>>>>>> -if processed successfully, push it back to completion queue with status = SUCCESS. If failed or run to
>>>>>> into OUT_OF_SPACE, then push it to completion queue with status = FAILURE/
>>>>>> OUT_OF_SPACE_RECOVERABLE and rest with status = NOT_PROCESSED and return with enqueue
>> count
>>>>>> = total # of ops submitted originally with burst?
>>>> [Ahmed] This is exactly what I had in mind. all ops will be submitted to
>>>> the HW. The HW will put all of them on the completion queue with the
>>>> correct status exactly as you say.
>>>>>> -app assumes all have been enqueued, so it go and dequeue all ops
>>>>>> -on seeing an op with OUT_OF_SPACE_RECOVERABLE, app resubmit a burst of ops with call to
>>>>>> stream_continue/resume API starting from op which encountered OUT_OF_SPACE and others as
>>>>>> NOT_PROCESSED with updated input and output buffer?
>>>> [Ahmed] Correct this is what we do today in our proprietary API.
>>>>>> -repeat until *all* are dequeued with status = SUCCESS or *any* with status = FAILURE? If anytime
>>>>>> failure is seen, then app start whole processing all over again or just drop this burst?!
>>>> [Ahmed] The app has the choice on how to proceed. If the issue is
>>>> recoverable then the application can continue this stream from where it
>>>> stopped. if the failure is unrecoverable then the application should
>>>> first fix the problem and start from the beginning of the stream.
>>>>>> If all of above is true, then I think we should add another API such as
>> rte_comp_enque_single_stream()
>>>>>> which will be functional under Feature Flag = ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS or better
>>>>>> name is SUPPORT_ENQUEUE_SINGLE_STREAM?!
>>>> [Ahmed] The main advantage in async use is lost if we force all related
>>>> ops to be in the same burst. if we do that, then we might as well merge
>>>> all the ops into one op. That would reduce the overhead.
>>>> The use mode I am proposing is only useful in cases where the data
>>>> becomes available after the first enqueue occurred. I want to allow the
>>>> caller to enqueue the second set of data as soon as it is available
>>>> regardless of whether or not the HW has already started working on the
>>>> first op inflight.
>>> [Shally] @ahmed,  Ok.. seems I missed a point here. So, confirm me following:
>>>
>>> As per current description in doc, expected stateful usage is:
>>> enqueue (op1) --> dequeue(op1) --> enqueue(op2)
>>>
>>> but you're suggesting to allow an option to change it to
>>>
>>> enqueue(op1) -->enqueue(op2)
>>>
>>> i.e.  multiple ops from same stream can be put in-flight via subsequent enqueue_burst() calls without
>> waiting to dequeue previous ones as PMD support it . So, no change to current definition of a burst. It will
>> still carry multiple streams where each op belonging to different stream ?!
>> [Ahmed] Correct. I guess a user could put two ops on the same burst that
>> belong to the same stream. In that case it would be more efficient to
>> merge the ops using scatter gather. Nonetheless, I would not add checks
>> in my implementation to limit that use. The hardware does not perceive a
>> difference between ops that came on one burst and ops that came on two
>> different bursts. to the hardware they are all ops. What matters is
>> which stream each op belongs to.
>>> if yes, then seems your HW can be setup for multiple streams so it is efficient for your case to support it
>> in DPDK PMD layer but our hw doesn't by-default and need SW to back it. Given that, I also suggest to
>> enable it under some feature flag.
>>> However it looks like an add-on and if it doesn't change current definition of a burst and minimum
>> expectation set on stateful processing described in this document, then IMO, you can propose this feature
>> as an incremental patch on baseline version, in absence of which,
>>> application will exercise stateful processing as described here (enq->deq->enq). Thoughts?
>> [Ahmed] Makes sense. I was worried that there might be fundamental
>> limitations to this mode of use in the API design. That is why I wanted
>> to share this use mode with you guys and see if it can be accommodated
>> using an incremental patch in the future.
>>>>> [Fiona] Am curious about Ahmed's response to this. I didn't get that a burst should carry only one
>> stream
>>>>> Or get how this makes a difference? As there can be many enqueue_burst() calls done before an
>> dequeue_burst()
>>>>> Maybe you're thinking the enqueue_burst() would be a blocking call that would not return until all the
>> ops
>>>>> had been processed? This would turn it into a synchronous call which isn't the intent.
>>>> [Ahmed] Agreed, a blocking or even a buffering software layer that baby
>>>> sits the hardware does not fundamentally change the parameters of the
>>>> system as a whole. It just moves workflow management complexity down
>>>> into the DPDK software layer. Rather there are real latency and
>>>> throughput advantages (because of caching) that I want to expose.
>>>>
> [Fiona] ok, so I think we've agreed that this can be an option, as long as not required of
> PMDs and enabled under an explicit capability - named something like
> ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS
> @Ahmed, we'll leave it up to you to define details.
> What's necessary is API text to describe the expected behaviour on any error conditions,
> the pause/resume API, whether an API is expected to clean up if resume doesn't happen
> and if there's any time limit on this, etc
> But I wouldn't expect any changes to existing burst APIs, and all PMDs and applications
> must be able to handle the default behaviour, i.e. with this capability disabled.
> Specifically even if a PMD has this capability, if an application ignores it and only sends
> one op at a time, if a PMD returns OUT_OF_SPACE_RECOVERABLE the stream should
> not be in a paused state and the PMD should not wait for a resume() to handle the 
> next op sent for that stream.
> Does that make sense?
[Ahmed] That make sense. When this mode is enabled then additional
functions must be called to resume the work, even if only one op was in
flight. When this mode is not enabled then the PMD assumes that the
caller will never enqueue a stateful op before receiving a response to
the one that precedes it in a stream
>
>>>> /// snip ///
>>>
>
    
    
More information about the dev
mailing list