[dpdk-dev] [RFC] Accelerator API to chain packet processing functions

Jerin Jacob jerinjacobk at gmail.com
Tue Feb 18 06:12:00 CET 2020


On Thu, Feb 13, 2020 at 5:01 PM Doherty, Declan
<declan.doherty at intel.com> wrote:
>
> On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> > On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle at intel.com> wrote:
> >>
> >> Hi Jerin,
> >
> > Hi David,
> >
> >> Thanks for the comments. Please see replies below.
> >>
> >> Kind Regards,
> >> David
> >>
> >>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle at intel.com> wrote:
> >>>>
> >>>> Introduction
> >>>> ============
> >>>>
> >>>> This RFC introduces a new DPDK library, rte_accelerator.
> >>>>
> >>>> The main aim of this library is to provide a flexible and extensible way of
> >>> combining one or more packet-processing functions into a single operation,
> >>> thereby allowing these to be performed in parallel in optimized software
> >>> libraries or in a hardware accelerator. These functions can include
> >>> cryptography, compression and CRC/checksum calculation, while others can
> >>> potentially be added in the future. Performing these functions in parallel as a
> >>> single operation can enable a significant performance improvement.
> >>>>
> >>>>
> >>>> Background
> >>>> ==========
> >>>>
> >>>> There are a number of byte-wise operations which are present and
> >>> common across many access network data-plane pipelines, such as Cipher,
> >>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> >>> prototyping has been done at Intel in relation to the 01.org access-network-
> >>> dataplanes project to prove that a significant performance improvement is
> >>> possible when such byte-wise operations are combined into a single pass of
> >>> packet data processing. This performance boost has been prototyped for
> >>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >>>
> >>>
> >>> Could you share the relative performance numbers to show the gain?
> >>
> >> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> >> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
> >>
> >> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> >> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC,  AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> >> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
> >>
> >> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
> >>          - XGS-PON MAC (Crypto-CRC-BIP):
> >>                  - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> >>                  - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> >>                  37% cycle reduction
> >>
> >>          - DOCSIS MAC (Crypto-CRC):
> >>                  - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> >>                  - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> >>                  20% cycle reduction
> >>
> >> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
> >
> > Numbers make sense. I have seen a similar performance improvement
> > doing in one pass with CPU instructions.
> >
> >
> >>>> - XGS-PON MAC: Crypto-CRC-BIP
> >>>>          - Order:
> >>>>                  - Downstream: CRC, Encrypt, BIP
> >>>
> >>> I understand if the chain has two operations then it may possible to have
> >>> handcrafted SW code to do both operations in one pass.
> >>> I understand the spec is agnostic on a number of passes it does require to
> >>> enable the xfrom but To understand the SW/HW capability, In the above
> >>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> >>> or one pass using HW?
> >>
> >> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> >> However, this could also be performed as a single pass in a HW accelerator
> >
> > As a specification, cascading the xform chains make sense.
> > Do we have any HW that does support chaining the xforms more than
> > "two" in one pass?
> > i.e real chaining function where two blocks of HWs work hand in hand
> > for chaining.
> > If none, it may be better to abstract as synonymous API(No dequeue, no
> > enqueue) for the CPU use case.
> >
>
> Where you thinking along the lines of a synchronous API option like that
> just introduced to crytodev? i.e something like
>
> uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
>                                  struct rte_accelerator_op ops[],
>                                  uint16_t nb_ops);

Yes. May be with capability or preference to denote application for
the preferred usage model.

>
>


More information about the dev mailing list