[dpdk-dev] [RFC] Accelerator API to chain packet processing functions

Doherty, Declan declan.doherty at intel.com
Thu Feb 13 12:31:02 CET 2020

On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle at intel.com> wrote:
>> Hi Jerin,
> Hi David,
>> Thanks for the comments. Please see replies below.
>> Kind Regards,
>> David
>>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle at intel.com> wrote:
>>>> Introduction
>>>> ============
>>>> This RFC introduces a new DPDK library, rte_accelerator.
>>>> The main aim of this library is to provide a flexible and extensible way of
>>> combining one or more packet-processing functions into a single operation,
>>> thereby allowing these to be performed in parallel in optimized software
>>> libraries or in a hardware accelerator. These functions can include
>>> cryptography, compression and CRC/checksum calculation, while others can
>>> potentially be added in the future. Performing these functions in parallel as a
>>> single operation can enable a significant performance improvement.
>>>> Background
>>>> ==========
>>>> There are a number of byte-wise operations which are present and
>>> common across many access network data-plane pipelines, such as Cipher,
>>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
>>> prototyping has been done at Intel in relation to the 01.org access-network-
>>> dataplanes project to prove that a significant performance improvement is
>>> possible when such byte-wise operations are combined into a single pass of
>>> packet data processing. This performance boost has been prototyped for
>>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
>>> Could you share the relative performance numbers to show the gain?
>> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
>> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
>> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
>> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
>>          - XGS-PON MAC (Crypto-CRC-BIP):
>>                  - 3 independent stages: 1429 cycles/buf (13.75Gbps)
>>                  - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
>>                  37% cycle reduction
>>          - DOCSIS MAC (Crypto-CRC):
>>                  - 2 independent stages: 1421 cycles/buf (13.84Gbps)
>>                  - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
>>                  20% cycle reduction
>> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
> Numbers make sense. I have seen a similar performance improvement
> doing in one pass with CPU instructions.
>>>> - XGS-PON MAC: Crypto-CRC-BIP
>>>>          - Order:
>>>>                  - Downstream: CRC, Encrypt, BIP
>>> I understand if the chain has two operations then it may possible to have
>>> handcrafted SW code to do both operations in one pass.
>>> I understand the spec is agnostic on a number of passes it does require to
>>> enable the xfrom but To understand the SW/HW capability, In the above
>>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
>>> or one pass using HW?
>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
>> However, this could also be performed as a single pass in a HW accelerator
> As a specification, cascading the xform chains make sense.
> Do we have any HW that does support chaining the xforms more than
> "two" in one pass?
> i.e real chaining function where two blocks of HWs work hand in hand
> for chaining.
> If none, it may be better to abstract as synonymous API(No dequeue, no
> enqueue) for the CPU use case.

Where you thinking along the lines of a synchronous API option like that 
just introduced to crytodev? i.e something like

uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
				 struct rte_accelerator_op ops[],
				 uint16_t nb_ops);

More information about the dev mailing list