[dpdk-dev] [RFC] Accelerator API to chain packet processing functions

Jerin Jacob jerinjacobk at gmail.com
Thu Feb 6 11:54:54 CET 2020


On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle at intel.com> wrote:
>
> Hi Jerin,

Hi David,

> Thanks for the comments. Please see replies below.
>
> Kind Regards,
> David
>
> > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle at intel.com> wrote:
> > >
> > > Introduction
> > > ============
> > >
> > > This RFC introduces a new DPDK library, rte_accelerator.
> > >
> > > The main aim of this library is to provide a flexible and extensible way of
> > combining one or more packet-processing functions into a single operation,
> > thereby allowing these to be performed in parallel in optimized software
> > libraries or in a hardware accelerator. These functions can include
> > cryptography, compression and CRC/checksum calculation, while others can
> > potentially be added in the future. Performing these functions in parallel as a
> > single operation can enable a significant performance improvement.
> > >
> > >
> > > Background
> > > ==========
> > >
> > > There are a number of byte-wise operations which are present and
> > common across many access network data-plane pipelines, such as Cipher,
> > Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> > prototyping has been done at Intel in relation to the 01.org access-network-
> > dataplanes project to prove that a significant performance improvement is
> > possible when such byte-wise operations are combined into a single pass of
> > packet data processing. This performance boost has been prototyped for
> > both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >
> >
> > Could you share the relative performance numbers to show the gain?
>
> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>
> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>
> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
>         - XGS-PON MAC (Crypto-CRC-BIP):
>                 - 3 independent stages: 1429 cycles/buf (13.75Gbps)
>                 - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
>                 37% cycle reduction
>
>         - DOCSIS MAC (Crypto-CRC):
>                 - 2 independent stages: 1421 cycles/buf (13.84Gbps)
>                 - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
>                 20% cycle reduction
>
> Adding the accelerator API will allow vendors gain the benefits of these cycle savings

Numbers make sense. I have seen a similar performance improvement
doing in one pass with CPU instructions.


> > > - XGS-PON MAC: Crypto-CRC-BIP
> > >         - Order:
> > >                 - Downstream: CRC, Encrypt, BIP
> >
> > I understand if the chain has two operations then it may possible to have
> > handcrafted SW code to do both operations in one pass.
> > I understand the spec is agnostic on a number of passes it does require to
> > enable the xfrom but To understand the SW/HW capability, In the above
> > case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> > or one pass using HW?
>
> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> However, this could also be performed as a single pass in a HW accelerator

As a specification, cascading the xform chains make sense.
Do we have any HW that does support chaining the xforms more than
"two" in one pass?
i.e real chaining function where two blocks of HWs work hand in hand
for chaining.
If none, it may be better to abstract as synonymous API(No dequeue, no
enqueue) for the CPU use case.


More information about the dev mailing list