[dpdk-users] Run-to-completion or Pipe-line for QAT PMD in DPDK

Pathak, Pravin pravin.pathak at intel.com
Fri Jan 18 15:29:04 CET 2019

Hi Alex -

-----Original Message-----
From: users [mailto:users-bounces at dpdk.org] On Behalf Of Trahe, Fiona
Sent: Friday, January 18, 2019 8:14 AM
To: Changchun Zhang <changchun.zhang at oracle.com>; users at dpdk.org
Cc: Trahe, Fiona <fiona.trahe at intel.com>
Subject: Re: [dpdk-users] Run-to-completion or Pipe-line for QAT PMD in DPDK

Hi Alex,

> -----Original Message-----
> From: users [mailto:users-bounces at dpdk.org] On Behalf Of Changchun 
> Zhang
> Sent: Thursday, January 17, 2019 11:01 PM
> To: users at dpdk.org
> Subject: [dpdk-users] Run-to-completion or Pipe-line for QAT PMD in 
> Hi,
> I have user question on using the QAT device in the DPDK.
> In the real design, after calling enqueuer_burst() on the specified 
> queue pair at one of the lcore, usually which one is usually done?
> 1.     should we do run-to-completion to call dequeuer_burst() waiting for the device finishing the
> crypto operation,
> 2.     or should we do pipe-line, in which we return right after enqueuer_burst() and release the CPU.
> And call dequeuer_burst() on other thread function?
> Option 1 is more like synchronous and can be seen on all the DPDK 
> crypto examples, while option 2 is asynchronous which I have never seen in any reference design if I missed anything.
Option 2 is not possible with QAT - the dequeue must be called in the same thread as the enqueue. This is optimised without atomics for best performance - if this is a problem let us know. 
However best performance is not quite using option 1 and not a synchronous blocking method. 
If you enqueue and then go straight to dequeue, you're not getting the best advantage from the cycles freed up by  offloading. 
i.e. best to enqueue a burst, then go do some other work, like maybe collecting more requests for next enqueue or other processing, then dequeue. Take and process whatever ops are dequeued - this will not necessarily match up with the number you've enqueued - depends on how quickly you call the dequeue.
Don't wait until all the enqueued ops are dequeued before enqueuing the next batch.
SO it's asynchronous. But in the same thread.
You'll get best throughput when you keep the input filled up so the device has operations to work on and regularly dequeue a burst. Dequeuing too often will waste cycles in the overhead calling the API, dequeuing too slowly will cause the device to back up. Ideally tune for your application to find the sweet spot in between these 2 extremes.  
I faced exact same issue while moving from software crypto to HW. I implemented option Fiona suggested.  
Thread enqueues to crypto engine and goes back to other work. It periodically polls crypto to see if work is finished.
As we have a single thread running, it keeps doing queuing as work arrives and de-queuing as results are ready while in between doing other stuff.
To keep track of packets, I put some ID into crypto operation private data.

More information about the users mailing list