<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: "IntelOne Text"; font-size: 10pt; color: rgb(0, 0, 0);">
Series-acked-by: Kai Ji <kai.ji@intel.com></div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Jack Bond-Preston <jack.bond-preston@foss.arm.com><br>
<b>Sent:</b> 03 June 2024 17:01<br>
<b>Cc:</b> dev@dpdk.org <dev@dpdk.org><br>
<b>Subject:</b> [PATCH 0/5] OpenSSL PMD Optimisations</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">The current implementation of the OpenSSL PMD has numerous performance issues.<br>
These revolve around certain operations being performed on a per buffer/packet<br>
basis, when they in fact could be performed less often - usually just during<br>
initialisation.<br>
<br>
<br>
[1/5]: fix GCM and CCM thread unsafe ctxs<br>
=========================================<br>
Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is<br>
implemented in the same naive (and inefficient) way as existing fixes for other<br>
ciphers, and is optimised later in [3/5].<br>
<br>
<br>
[2/5]: only init 3DES-CTR key + impl once<br>
===========================================<br>
Fixes an inefficient usage of the OpenSSL API for 3DES-CTR.<br>
<br>
<br>
[5/5]: only set cipher padding once<br>
=====================================<br>
Fixes an inefficient usage of the OpenSSL API when disabling padding for<br>
ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl:<br>
fix extra bytes written at end of data"), which fixes a bug - however, the<br>
EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location.<br>
<br>
This patch fixes this, preventing the padding being disabled for the cipher<br>
twice per buffer (with the second essentially being a wasteful no-op).<br>
<br>
<br>
[3/5] and [4/5]: per-queue-pair context clones<br>
==============================================<br>
[3/5] and [4/5] aim to fix the key issue that was identified with the<br>
performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a<br>
per-buffer basis.<br>
This behaviour was introduced in 2019:<br>
> commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a<br>
> Author: Thierry Herbelot <thierry.herbelot@6wind.com><br>
> Date: Wed Sep 11 18:06:01 2019 +0200<br>
><br>
> crypto/openssl: use local copy for session contexts<br>
><br>
> Session contexts are used for temporary storage when processing a<br>
> packet.<br>
> If packets for the same session are to be processed simultaneously on<br>
> multiple cores, separate contexts must be used.<br>
><br>
> Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a<br>
> variable on the stack: it must be allocated. This in turn reduces the<br>
> performance.<br>
<br>
Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used<br>
from multiple threads simultaneously, so this patch is required for correctness<br>
(assuming the need to support using the same openssl_session across multiple<br>
lcores). The downside here is that, as the commit message notes, this does<br>
reduce performance quite significantly.<br>
<br>
It is worth noting that while ciphers were already correctly cloned for cipher<br>
ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM<br>
and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f<br>
("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this<br>
issue of correctness, and [3/5] implements a more performant fix on top of this.<br>
<br>
These two patches aim to remedy the performance loss caused by the introduction<br>
of cipher context cloning. An approach of maintaining an array of pointers,<br>
inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL<br>
CTXs is used. Consequently, there is no need to perform cloning of the context<br>
for every buffer - whilst keeping the guarantee that one context is not being<br>
used on multiple lcores simultaneously. The cloning of the main context into the<br>
array's per-qp context entries is performed lazily/as-needed. There are some<br>
trade-offs/judgement calls that were made:<br>
- The first call for a queue pair for an op from a given openssl_session will<br>
be roughly equivalent to an op from the existing implementation. However, all<br>
subsequent calls for the same openssl_session on the same queue pair will not<br>
incur this extra work. Thus, whilst the first op on a session on a queue pair<br>
will be slower than subsequent ones, this slower first op is still equivalent<br>
to *every* op without these patches. The alternative would be pre-populating<br>
this array when the openssl_session is initialised, but this would waste<br>
memory and processing time if not all queue pairs end up doing work from this<br>
openssl_session.<br>
- Each pointer inside the array of per-queue-pair pointers has not been cache<br>
aligned, because updates only occur on the first buffer per-queue-pair<br>
per-session, making the impact of false sharing negligible compared to the<br>
extra memory usage of the alignment.<br>
<br>
[3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5]<br>
for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.).<br>
<br>
Compared to before, this approach comes with a drawback of extra memory usage -<br>
the cause of which is twofold:<br>
- The openssl_session struct has grown to accommodate the array, with a length<br>
equal to the number of qps in use multiplied by 2 (to allow auth and cipher<br>
contexts), per openssl_session structure. openssl_pmd_sym_session_get_size()<br>
is modified to return a size large enough to support this. At the time this<br>
function is called (before the user creates the session mempool), the PMD may<br>
not yet be configured with the requested number of queue pairs. In this case,<br>
the maximum number of queue pairs allowed by the PMD (current default is 8) is<br>
used, to ensure the allocations will be large enough. Thus, the user may be<br>
able to slightly reduce the memory used by OpenSSL sessions by first<br>
configuring the PMD's queue pair count, then requesting the size of the<br>
sessions and creating the session mempool. There is also a special case where<br>
the number of queue pairs is 1, in which case the array is not allocated or<br>
used at all. Overall, this memory usage by the session structure itself is<br>
worst-case 128 bytes per session (the default maximum number of queue pairs<br>
allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra<br>
space to store the length of the array and auth context offset, resulting in<br>
an increase in total size from 152 bytes to 280 bytes.<br>
- The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the<br>
clones were allocated and freed per-operation, meaning the lifetime of the<br>
allocations was only the duration of the operation. Now, these allocations are<br>
lifted out to share the lifetime of the session. This results in situations<br>
with many long-lived sessions shared across many queue pairs causing an<br>
increase in total memory usage.<br>
<br>
<br>
Performance Comparisons<br>
=======================<br>
Benchmarks were collected using dpdk-test-crypto-perf, for the following<br>
configurations:<br>
- The version of OpenSSL used was 3.3.0<br>
- The hardware used for the benchmarks was the following two machine configs:<br>
* AArch64: Ampere Altra Max (128 N1 cores, 1 socket)<br>
* x86 : Intel Xeon Platinum 8480+ (128 cores, 2 sockets)<br>
- The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048,<br>
4096, 8192.<br>
- The worker lcore counts tested were: 1, 2, 4, 8<br>
- The algorithms and associated operations tested were:<br>
* Cipher-only AES-CBC-128 (Encrypt and Decrypt)<br>
* Cipher-only 3DES-CTR-128 (Encrypt only)<br>
* Auth-only SHA1-HMAC (Generate only)<br>
* Auth-only AES-CMAC (Generate only)<br>
* AESNI AES-GCM-128 (Encrypt and Decrypt)<br>
* Cipher-then-Auth AES-CBC-128-HMAC-SHA1 (Encrypt only)<br>
- EAL was configured with Legacy Memory Mode enabled.<br>
The application was always run on isolated CPU cores on the same socket.<br>
<br>
The sets of patches applied for benchmarks were:<br>
- No patches applied (HEAD of upstream main)<br>
- [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue)<br>
- [1-2/5] applied (adds 3DES-CTR fix)<br>
- [1-3/5] applied (adds per-qp cipher contexts)<br>
- [1-4/5] applied (adds per-qp auth contexts)<br>
- [1-5/5] applied (adds cipher padding setting fix)<br>
<br>
For brevity, all results included in the cover letter are from the Arm platform,<br>
with all patches applied. Very similar results were achieved on the Intel<br>
platform, and the full set of results, including the Intel ones, is available.<br>
<br>
AES-CBC-128 Encrypt Throughput Speedup<br>
--------------------------------------<br>
A comparison of the throughput speedup achieved between the base (main branch<br>
HEAD) and optimised (all patches applied) versions of the PMD was carried out,<br>
with the varying worker lcore counts.<br>
<br>
1 worker lcore:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 0.84 | 2.04 | 144.6% |<br>
| 64 | 1.61 | 3.72 | 131.3% |<br>
| 128 | 2.97 | 6.24 | 110.2% |<br>
| 256 | 5.14 | 9.42 | 83.2% |<br>
| 512 | 8.10 | 12.62 | 55.7% |<br>
| 1024 | 11.37 | 15.18 | 33.5% |<br>
| 2048 | 14.26 | 16.93 | 18.7% |<br>
| 4096 | 16.35 | 17.97 | 9.9% |<br>
| 8192 | 17.61 | 18.51 | 5.1% |<br>
<br>
8 worker lcores:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 1.53 | 16.49 | 974.8% |<br>
| 64 | 3.04 | 29.85 | 881.3% |<br>
| 128 | 5.96 | 50.07 | 739.8% |<br>
| 256 | 10.54 | 75.53 | 616.5% |<br>
| 512 | 21.60 | 101.14 | 368.2% |<br>
| 1024 | 41.27 | 121.56 | 194.6% |<br>
| 2048 | 72.99 | 135.40 | 85.5% |<br>
| 4096 | 103.39 | 143.76 | 39.0% |<br>
| 8192 | 125.48 | 148.06 | 18.0% |<br>
<br>
It is evident from these results that the speedup with 8 worker lcores is<br>
significantly larger. This was surprising at first, so profiling of the existing<br>
PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX<br>
contains an EVP_CIPHER, which represents the actual cipher algorithm<br>
implementation backing this context. OpenSSL holds only one instance of each<br>
EVP_CIPHER, and uses a reference counter to track freeing them. This means that<br>
the original implementation spends a very high amount of time incrementing and<br>
decrementing this reference counter in EVP_CIPHER_CTX_copy and<br>
EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores,<br>
this reference count modification happens extremely frequently - thrashing this<br>
refcount on all lcores and causing a huge slowdown. The optimised version avoids<br>
this by not performing the copy and free (and thus associated refcount<br>
modifications) on every buffer.<br>
<br>
SHA1-HMAC Generate Throughput Speedup<br>
-------------------------------------<br>
1 worker lcore:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 0.32 | 0.76 | 135.9% |<br>
| 64 | 0.63 | 1.43 | 126.9% |<br>
| 128 | 1.21 | 2.60 | 115.4% |<br>
| 256 | 2.23 | 4.42 | 98.1% |<br>
| 512 | 3.88 | 6.80 | 75.5% |<br>
| 1024 | 6.13 | 9.30 | 51.8% |<br>
| 2048 | 8.65 | 11.39 | 31.7% |<br>
| 4096 | 10.90 | 12.85 | 17.9% |<br>
| 8192 | 12.54 | 13.74 | 9.5% |<br>
8 worker lcores:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 0.49 | 5.99 | 1110.3% |<br>
| 64 | 0.98 | 11.30 | 1051.8% |<br>
| 128 | 1.95 | 20.67 | 960.3% |<br>
| 256 | 3.90 | 35.18 | 802.4% |<br>
| 512 | 7.83 | 54.13 | 590.9% |<br>
| 1024 | 15.80 | 74.11 | 369.2% |<br>
| 2048 | 31.30 | 90.97 | 190.6% |<br>
| 4096 | 58.59 | 102.70 | 75.3% |<br>
| 8192 | 85.93 | 109.88 | 27.9% |<br>
<br>
We can see the results are similar as for AES-CBC-128 cipher operations.<br>
<br>
AES-GCM-128 Encrypt Throughput Speedup<br>
--------------------------------------<br>
As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for<br>
the concurrency bug introduces a large overhead.<br>
<br>
1 worker lcore:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 64 | 2.60 | 1.31 | -49.5% |<br>
| 256 | 7.69 | 4.45 | -42.1% |<br>
| 1024 | 15.33 | 11.30 | -26.3% |<br>
| 2048 | 18.74 | 15.37 | -18.0% |<br>
| 4096 | 21.11 | 18.80 | -10.9% |<br>
<br>
8 worker lcores:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 64 | 19.94 | 2.83 | -85.8% |<br>
| 256 | 58.84 | 11.00 | -81.3% |<br>
| 1024 | 119.71 | 42.46 | -64.5% |<br>
| 2048 | 147.69 | 80.91 | -45.2% |<br>
| 4096 | 167.39 | 121.25 | -27.6% |<br>
<br>
However, applying [3/5] rectifies most of this performance drop, as shown by the<br>
following results with it applied.<br>
<br>
1 worker lcore:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 1.39 | 1.28 | -7.8% |<br>
| 64 | 2.60 | 2.44 | -6.2% |<br>
| 128 | 4.77 | 4.45 | -6.8% |<br>
| 256 | 7.69 | 7.22 | -6.1% |<br>
| 512 | 11.31 | 10.97 | -3.0% |<br>
| 1024 | 15.33 | 15.07 | -1.7% |<br>
| 2048 | 18.74 | 18.51 | -1.2% |<br>
| 4096 | 21.11 | 20.96 | -0.7% |<br>
| 8192 | 22.55 | 22.50 | -0.2% |<br>
<br>
8 worker lcores:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 10.59 | 10.35 | -2.3% |<br>
| 64 | 19.94 | 19.46 | -2.4% |<br>
| 128 | 36.32 | 35.64 | -1.9% |<br>
| 256 | 58.84 | 57.80 | -1.8% |<br>
| 512 | 87.38 | 87.37 | -0.0% |<br>
| 1024 | 119.71 | 120.22 | 0.4% |<br>
| 2048 | 147.69 | 147.93 | 0.2% |<br>
| 4096 | 167.39 | 167.48 | 0.1% |<br>
| 8192 | 179.80 | 179.87 | 0.0% |<br>
<br>
The results show that, for AES-GCM-128 encrypt, there is still a small slowdown<br>
at smaller buffer sizes. This represents the overhead required to make AES-GCM<br>
thread safe. These patches have rectified this lack of safety without causing a<br>
significant performance impact, especially compared to naive per-buffer cipher<br>
context cloning.<br>
<br>
3DES-CTR Encrypt<br>
----------------<br>
1 worker lcore:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 0.12 | 0.22 | 89.7% |<br>
| 64 | 0.16 | 0.22 | 43.6% |<br>
| 128 | 0.18 | 0.23 | 22.3% |<br>
| 256 | 0.20 | 0.23 | 10.8% |<br>
| 512 | 0.21 | 0.23 | 5.1% |<br>
| 1024 | 0.22 | 0.23 | 2.7% |<br>
| 2048 | 0.22 | 0.23 | 1.3% |<br>
| 4096 | 0.23 | 0.23 | 0.4% |<br>
| 8192 | 0.23 | 0.23 | 0.4% |<br>
<br>
8 worker lcores:<br>
| buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift |<br>
|-----------------+---------------+--------------------+----------|<br>
| 32 | 0.68 | 1.77 | 160.1% |<br>
| 64 | 1.00 | 1.78 | 78.3% |<br>
| 128 | 1.29 | 1.80 | 39.6% |<br>
| 256 | 1.50 | 1.80 | 19.8% |<br>
| 512 | 1.64 | 1.80 | 10.0% |<br>
| 1024 | 1.72 | 1.81 | 5.1% |<br>
| 2048 | 1.76 | 1.81 | 2.7% |<br>
| 4096 | 1.78 | 1.81 | 1.5% |<br>
| 8192 | 1.80 | 1.81 | 0.7% |<br>
<br>
[1/4] yields good results - the performance increase is high for lower buffer<br>
sizes, where the cost of re-initialising the extra parameters is more<br>
significant compared to the cost of the cipher operation.<br>
<br>
Full Data and Additional Bar Charts<br>
-----------------------------------<br>
The full raw data (CSV) and a PDF of all generated figures (all generated<br>
speedup tables, plus additional bar charts showing the throughput comparison<br>
across different sets of applied patches) - for both Intel and Arm platforms -<br>
are available. However, I'm not sure of the ettiquette regarding attachments of<br>
such files, so I haven't attached them for now. If you are interested in<br>
reviewing them, please reach out and I will find a way to get them to you.<br>
<br>
Jack Bond-Preston (5):<br>
crypto/openssl: fix GCM and CCM thread unsafe ctxs<br>
crypto/openssl: only init 3DES-CTR key + impl once<br>
crypto/openssl: per-qp cipher context clones<br>
crypto/openssl: per-qp auth context clones<br>
crypto/openssl: only set cipher padding once<br>
<br>
drivers/crypto/openssl/compat.h | 26 ++<br>
drivers/crypto/openssl/openssl_pmd_private.h | 26 +-<br>
drivers/crypto/openssl/rte_openssl_pmd.c | 244 ++++++++++++++-----<br>
drivers/crypto/openssl/rte_openssl_pmd_ops.c | 35 ++-<br>
4 files changed, 260 insertions(+), 71 deletions(-)<br>
<br>
-- <br>
2.34.1<br>
<br>
</div>
</span></font></div>
</body>
</html>