[dpdk-dev] [RFC v1] doc compression API for DPDK

Trahe, Fiona fiona.trahe at intel.com
Fri Dec 15 18:40:49 CET 2017

Hi Shally,

> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma at cavium.com]
> Sent: Thursday, December 7, 2017 5:43 AM
> To: Trahe, Fiona <fiona.trahe at intel.com>; dev at dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya at cavium.com>; Challa, Mahipal
> <Mahipal.Challa at cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch at intel.com>; Gupta, Ashish
> <Ashish.Gupta at cavium.com>; Sahu, Sunila <Sunila.Sahu at cavium.com>
> Subject: RE: [RFC v1] doc compression API for DPDK


> > > > > Please note any time output buffer ran out of space during write then
> > > > operation will turn “Stateful”.  See
> > > > > more on Stateful under respective section.
> > > > [Fiona] Let's come back to this later. An alternative is that
> > > > returned and the  application
> > > > must treat as a fail and resubmit the operation with a larger destination
> > > > buffer.
> > >
> > > [Shally] Then I propose to add a feature flag
> > "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
> > > PMD design.
> > > As there're devices which treat it as error on compression but not on
> > decompression.
> > > If it is not supported, then it should be treated as failure condition and app
> > can resubmit operation.
> > > if supported, behaviour *To-be-Defined* under stateful.
> > [Fiona] Can you explain 'turn stateful' some more?
> > If compressor runs out of space during stateless operation, either comp or
> > decomp, and turns stateful, how would the app know? And what would be in
> > status, consumed and produced?
> > Could it return OUT_OF_SPACE, and if both consumed and produced == 0
> [Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE condition.
> > then the whole op must be resubmitted with a bigger output buffer. But if
> > consumed and produced > 0 then app could take the output and submit next
> > op
> > continuing from consumed+1.
> >
> [Shally] consumed and produced will *always* be > 0 in case of OUT_OF_SPACE.
> OUT_OF_SPACE means output buffer exhausted while writing data into it and PMD may have more to
> write to it. So in such case, PMD should set
> Produced = complete length of output buffer
> Status = OUT_OF_SPACE
> consume, following possibilities here:
> 1. consumed = complete length of src mbuf means PMD has read full input, OR
> 2. consumed = partial length of src mbuf means PMD has read partial input
> On seeing this status, app should consume output and re-enqueue same op with empty output buffer and
> src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have stored the history and state and so
cannot be expected to continue from consumed+1. This would be stateful behaviour. 
But it seems you are saying that even on in this stateless case you'd like the PMDs who can store state 
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and resubmit the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the same information on the API?
It may correspond to an error on the underlying PMD, but would it be simpler on the compressdev API

> Please note as per current proposal, app should call rte_compdev_enqueue_stream() version of API if it
> doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the stateless error case.


> > > > >
> > > > > D.2.1.2 Stateful operation state maintenance
> > > > >  -------------------------------------------------------------
> > > > > This section starts with description of our understanding about
> > > > compression API support for stateful.
> > > > > Depending upon understanding build upon these concepts, we will
> > identify
> > > > required data structure/param
> > > > > to maintain in-progress operation context by PMD.
> > > > >
> > > > > For stateful compression, batch of dependent packets starts at a packet
> > > > having
> > > > > RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
> > > > > i.e. array of operations will carry structure like this:
> > > > >
> > > > > ------------------------------------------------------------------------------------
> > > > > |op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
> > > > > ------------------------------------------------------------------------------------
> > > > >
> > [Fiona] I think it needs to be more constrained than your examples below.
> > Only 1 operation from a stream can be in a burst. As each operation
> > in a stateful stream must complete, as next operation needs state and
> > history
> > of previous operation to be complete before it can be processed.
> > And if one failed, e.g. due to OUT_OF_SPACE, this should affect
> > the following operation in the same stream.
> > Worst case this means bursts of 1. Burst can be >1 if there are multiple
> > independent streams with available data for processing. Or if there is
> > data available which can be statelessly processed.
> >
> > If there are multiple buffers available from a stream , then instead they can
> > be linked together in an mbuf chain sent in a single operation.
> >
> > To handle the sequences below would mean the PMD
> > would need to store ops sending one at a time to be processed.
> >
> > As this is significantly different from what you describe below, I'll wait for
> > further feedback
> > before continuing.
> [Shally] I concur with your thoughts. And these're are not significantly different from the concept
> presented below.
> Yes as you mentioned, even for burst_size>1 PMD will have to serialize each op internally i.e.
> It has to wait for previous to finish before putting next for processing which is
> as good as application making serialised call passing one op at-a-time or if
> stream consists of multiple buffers, making their scatter-gather list and
> then enqueue it as one op at a time which is more efficient and ideal usage.
> However in order to allow extensibility, I didn't mention limitation on burst_size.
> Because If PMD doesn't support burst_size > 1 it can always return nb_enqueued = 1, in which case
> app can enqueue next however with condition it should wait for previous to complete
> before making next enqueue call.
> So, if we take simple example to compress 2k of data with src mbuf size = 1k.
> Then with burst_size=1, expected call flow would be(this is just one flow, other variations are also possible
> suchas making chain of 1k buffers and pass whole data in one go):
> 1. fill 1st 1k chunk of data in op.msrc
> 2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
> 3.dequeue_burst(|op|,1);
> 4.refill next 1k chunk in op.msrc
> 5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
> 6.dequeue_burst(|op|, 1);
> 7.end
> So, I don’t see much of a change in API call flow from here to design presented below except nb_ops = 1 in
> each call.
> However I am assuming that op structure would still be same for stateful processing i.e. it would start with
> op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL FLUSH.
> Are we on same page here?
> Thanks
> Shally

[Fiona] We still have a different understanding of the stateful flow needed on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss. 
My assumptions first:
•	Order of ops on a qp must be maintained – ops should be dequeued in same sequence they are enqueued.
•	Ops from many streams can be enqueued on same qp.
•	Ops from a qp may be fanned out to available hw or sw engines and processed in parallel, so each op must be independent. 
•	Stateless and stateful ops can be enqueued on the same qp

Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a qp is a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest mbufs, input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op, overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?  
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops are offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the next application dequeue is called.  

Above scenarios don’t lend themselves to accelerating a packet processing workload.
It pushes a workload down to all PMDs which I believe belongs above this API as 
that work is not about offloading the compute intensive compression work but
about the sequencing of data and so is better coded once, above the API in an application layer
common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017-October/078944.html )
If an application has several packets with data from a stream that it needs to (de)compress statefully,
what it probably wants is for the output data to fill each output buffer completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op allows the output
data to be packed into a dest mbuf or mbuf chain.   
I think what’s needed is a layer above the API to accumulate incoming packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to queue there is not the right place
to buffer them as the queue should be per stream rather than on the accelerator engine’s queue
which has lots of other independent packets.

• Ops from a qp may be fanned out to available hw or sw engines and
    processed in parallel, so each op must be independent. 
• Order of ops on a qp must be maintained – ops should be dequeued in same sequence they are enqueued.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful stream. 
• A burst can have multiple stateful ops, but each must be from a different stream.
• All ops will have a session attached – this will only contain immutable data which
   can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
   history, this can only be used by one op at a time.

Code artefacts:

enum rte_comp_op_type {

Add following to rte_comp_op:
    enum rte_comp_op_type op_type;
    void * stream_private; 
    /* location where PMD maintains stream state – only required if op_type is STATEFUL, else set to NULL */

As size of stream data will vary depending on PMD, each PMD or device should allocate & manage its own mempool. Associated APIs are:
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void ** stream);
/* This should alloc a stream from the device’s mempool and initialise it. This handle will be passed to the PMD with every op in the stream. Q. Should  qp_id also be added, with constraint that all ops in the same stream should be sent to the same qp?  */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */

All ops are enqueued/dequeued to device & qp using same rte_compressdev_enqueue_burst()/…dequeue_burst;

Re flush flags, stateful stream would start with op.flush = NONE or SYNC and end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL

Let me know if you want to set up a meeting - it might be a more effective way to
arrive at an API that works for all PMDs.

I'll send out a v3 today with above plus updates based on all the other feedback.


More information about the dev mailing list