<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=us-ascii"> <style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style> </head> <body dir="ltr"> <div class="elementToProof" style="font-family: "IntelOne Text"; font-size: 10pt; color: rgb(0, 0, 0);"> Series-acked-by: Kai Ji <kai.ji@intel.com></div> <div id="appendonsend"></div> <hr style="display:inline-block;width:98%" tabindex="-1"> <div id="divRplyFwdMsg" dir="ltr">From: Jack Bond-Preston <jack.bond-preston@foss.arm.com> Sent: 03 June 2024 17:01 Cc: dev@dpdk.org <dev@dpdk.org> Subject: [PATCH 0/5] OpenSSL PMD Optimisations <div> </div> </div> <div class="BodyFragment"> <div class="PlainText">The current implementation of the OpenSSL PMD has numerous performance issues. These revolve around certain operations being performed on a per buffer/packet basis, when they in fact could be performed less often - usually just during initialisation. [1/5]: fix GCM and CCM thread unsafe ctxs ========================================= Fixes a concurrency bug affecting AES-GCM and AES-CCM ciphers. This fix is implemented in the same naive (and inefficient) way as existing fixes for other ciphers, and is optimised later in [3/5]. [2/5]: only init 3DES-CTR key + impl once =========================================== Fixes an inefficient usage of the OpenSSL API for 3DES-CTR. [5/5]: only set cipher padding once ===================================== Fixes an inefficient usage of the OpenSSL API when disabling padding for ciphers. This behaviour was introduced in commit 6b283a03216e ("crypto/openssl: fix extra bytes written at end of data"), which fixes a bug - however, the EVP_CIPHER_CTX_set_padding() call was placed in a suboptimal location. This patch fixes this, preventing the padding being disabled for the cipher twice per buffer (with the second essentially being a wasteful no-op). [3/5] and [4/5]: per-queue-pair context clones ============================================== [3/5] and [4/5] aim to fix the key issue that was identified with the performance of the OpenSSL PMD - cloning of OpenSSL CTX structures on a per-buffer basis. This behaviour was introduced in 2019: > commit 67ab783b5d70aed77d9ee3f3ae4688a70c42a49a > Author: Thierry Herbelot <thierry.herbelot@6wind.com> > Date: Wed Sep 11 18:06:01 2019 +0200 > > crypto/openssl: use local copy for session contexts > > Session contexts are used for temporary storage when processing a > packet. > If packets for the same session are to be processed simultaneously on > multiple cores, separate contexts must be used. > > Note: with openssl 1.1.1 EVP_CIPHER_CTX can no longer be defined as a > variable on the stack: it must be allocated. This in turn reduces the > performance. Indeed, OpenSSL contexts (both cipher and authentication) cannot safely be used from multiple threads simultaneously, so this patch is required for correctness (assuming the need to support using the same openssl_session across multiple lcores). The downside here is that, as the commit message notes, this does reduce performance quite significantly. It is worth noting that while ciphers were already correctly cloned for cipher ops and auth ops, this behaviour was actually absent for combined ops (AES-GCM and AES-CCM), due to this part of the fix being reverted in 75adf1eae44f ("crypto/openssl: update HMAC routine with 3.0 EVP API"). [1/5] addressed this issue of correctness, and [3/5] implements a more performant fix on top of this. These two patches aim to remedy the performance loss caused by the introduction of cipher context cloning. An approach of maintaining an array of pointers, inside the OpenSSL session structure, to per-queue-pair clones of the OpenSSL CTXs is used. Consequently, there is no need to perform cloning of the context for every buffer - whilst keeping the guarantee that one context is not being used on multiple lcores simultaneously. The cloning of the main context into the array's per-qp context entries is performed lazily/as-needed. There are some trade-offs/judgement calls that were made: - The first call for a queue pair for an op from a given openssl_session will be roughly equivalent to an op from the existing implementation. However, all subsequent calls for the same openssl_session on the same queue pair will not incur this extra work. Thus, whilst the first op on a session on a queue pair will be slower than subsequent ones, this slower first op is still equivalent to *every* op without these patches. The alternative would be pre-populating this array when the openssl_session is initialised, but this would waste memory and processing time if not all queue pairs end up doing work from this openssl_session. - Each pointer inside the array of per-queue-pair pointers has not been cache aligned, because updates only occur on the first buffer per-queue-pair per-session, making the impact of false sharing negligible compared to the extra memory usage of the alignment. [3/5] implements this approach for cipher contexts (EVP_CIPHER_CTX), and [4/5] for authentication contexts (EVP_MD_CTX, EVP_MAC_CTX, etc.). Compared to before, this approach comes with a drawback of extra memory usage - the cause of which is twofold: - The openssl_session struct has grown to accommodate the array, with a length equal to the number of qps in use multiplied by 2 (to allow auth and cipher contexts), per openssl_session structure. openssl_pmd_sym_session_get_size() is modified to return a size large enough to support this. At the time this function is called (before the user creates the session mempool), the PMD may not yet be configured with the requested number of queue pairs. In this case, the maximum number of queue pairs allowed by the PMD (current default is 8) is used, to ensure the allocations will be large enough. Thus, the user may be able to slightly reduce the memory used by OpenSSL sessions by first configuring the PMD's queue pair count, then requesting the size of the sessions and creating the session mempool. There is also a special case where the number of queue pairs is 1, in which case the array is not allocated or used at all. Overall, this memory usage by the session structure itself is worst-case 128 bytes per session (the default maximum number of queue pairs allowed by the OpenSSL PMD is 8, so 8qps * 8bytes * 2ctxs), plus the extra space to store the length of the array and auth context offset, resulting in an increase in total size from 152 bytes to 280 bytes. - The lifetime of OpenSSL's EVP CTX allocations is increased. Previously, the clones were allocated and freed per-operation, meaning the lifetime of the allocations was only the duration of the operation. Now, these allocations are lifted out to share the lifetime of the session. This results in situations with many long-lived sessions shared across many queue pairs causing an increase in total memory usage. Performance Comparisons ======================= Benchmarks were collected using dpdk-test-crypto-perf, for the following configurations: - The version of OpenSSL used was 3.3.0 - The hardware used for the benchmarks was the following two machine configs: * AArch64: Ampere Altra Max (128 N1 cores, 1 socket) * x86 : Intel Xeon Platinum 8480+ (128 cores, 2 sockets) - The buffer sizes tested were (in bytes): 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192. - The worker lcore counts tested were: 1, 2, 4, 8 - The algorithms and associated operations tested were: * Cipher-only AES-CBC-128 (Encrypt and Decrypt) * Cipher-only 3DES-CTR-128 (Encrypt only) * Auth-only SHA1-HMAC (Generate only) * Auth-only AES-CMAC (Generate only) * AESNI AES-GCM-128 (Encrypt and Decrypt) * Cipher-then-Auth AES-CBC-128-HMAC-SHA1 (Encrypt only) - EAL was configured with Legacy Memory Mode enabled. The application was always run on isolated CPU cores on the same socket. The sets of patches applied for benchmarks were: - No patches applied (HEAD of upstream main) - [1/5] applied (fixes AES-GCM and AES-CCM concurrency issue) - [1-2/5] applied (adds 3DES-CTR fix) - [1-3/5] applied (adds per-qp cipher contexts) - [1-4/5] applied (adds per-qp auth contexts) - [1-5/5] applied (adds cipher padding setting fix) For brevity, all results included in the cover letter are from the Arm platform, with all patches applied. Very similar results were achieved on the Intel platform, and the full set of results, including the Intel ones, is available. AES-CBC-128 Encrypt Throughput Speedup -------------------------------------- A comparison of the throughput speedup achieved between the base (main branch HEAD) and optimised (all patches applied) versions of the PMD was carried out, with the varying worker lcore counts. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.84 | 2.04 | 144.6% | | 64 | 1.61 | 3.72 | 131.3% | | 128 | 2.97 | 6.24 | 110.2% | | 256 | 5.14 | 9.42 | 83.2% | | 512 | 8.10 | 12.62 | 55.7% | | 1024 | 11.37 | 15.18 | 33.5% | | 2048 | 14.26 | 16.93 | 18.7% | | 4096 | 16.35 | 17.97 | 9.9% | | 8192 | 17.61 | 18.51 | 5.1% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 1.53 | 16.49 | 974.8% | | 64 | 3.04 | 29.85 | 881.3% | | 128 | 5.96 | 50.07 | 739.8% | | 256 | 10.54 | 75.53 | 616.5% | | 512 | 21.60 | 101.14 | 368.2% | | 1024 | 41.27 | 121.56 | 194.6% | | 2048 | 72.99 | 135.40 | 85.5% | | 4096 | 103.39 | 143.76 | 39.0% | | 8192 | 125.48 | 148.06 | 18.0% | It is evident from these results that the speedup with 8 worker lcores is significantly larger. This was surprising at first, so profiling of the existing PMD implementation with multiple lcores was performed. Every EVP_CIPHER_CTX contains an EVP_CIPHER, which represents the actual cipher algorithm implementation backing this context. OpenSSL holds only one instance of each EVP_CIPHER, and uses a reference counter to track freeing them. This means that the original implementation spends a very high amount of time incrementing and decrementing this reference counter in EVP_CIPHER_CTX_copy and EVP_CIPHER_CTX_free, respectively. For small buffer sizes, and with more lcores, this reference count modification happens extremely frequently - thrashing this refcount on all lcores and causing a huge slowdown. The optimised version avoids this by not performing the copy and free (and thus associated refcount modifications) on every buffer. SHA1-HMAC Generate Throughput Speedup ------------------------------------- 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.32 | 0.76 | 135.9% | | 64 | 0.63 | 1.43 | 126.9% | | 128 | 1.21 | 2.60 | 115.4% | | 256 | 2.23 | 4.42 | 98.1% | | 512 | 3.88 | 6.80 | 75.5% | | 1024 | 6.13 | 9.30 | 51.8% | | 2048 | 8.65 | 11.39 | 31.7% | | 4096 | 10.90 | 12.85 | 17.9% | | 8192 | 12.54 | 13.74 | 9.5% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.49 | 5.99 | 1110.3% | | 64 | 0.98 | 11.30 | 1051.8% | | 128 | 1.95 | 20.67 | 960.3% | | 256 | 3.90 | 35.18 | 802.4% | | 512 | 7.83 | 54.13 | 590.9% | | 1024 | 15.80 | 74.11 | 369.2% | | 2048 | 31.30 | 90.97 | 190.6% | | 4096 | 58.59 | 102.70 | 75.3% | | 8192 | 85.93 | 109.88 | 27.9% | We can see the results are similar as for AES-CBC-128 cipher operations. AES-GCM-128 Encrypt Throughput Speedup -------------------------------------- As the results below show, [1/5] causes a slowdown in AES-GCM, as the fix for the concurrency bug introduces a large overhead. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 64 | 2.60 | 1.31 | -49.5% | | 256 | 7.69 | 4.45 | -42.1% | | 1024 | 15.33 | 11.30 | -26.3% | | 2048 | 18.74 | 15.37 | -18.0% | | 4096 | 21.11 | 18.80 | -10.9% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 64 | 19.94 | 2.83 | -85.8% | | 256 | 58.84 | 11.00 | -81.3% | | 1024 | 119.71 | 42.46 | -64.5% | | 2048 | 147.69 | 80.91 | -45.2% | | 4096 | 167.39 | 121.25 | -27.6% | However, applying [3/5] rectifies most of this performance drop, as shown by the following results with it applied. 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 1.39 | 1.28 | -7.8% | | 64 | 2.60 | 2.44 | -6.2% | | 128 | 4.77 | 4.45 | -6.8% | | 256 | 7.69 | 7.22 | -6.1% | | 512 | 11.31 | 10.97 | -3.0% | | 1024 | 15.33 | 15.07 | -1.7% | | 2048 | 18.74 | 18.51 | -1.2% | | 4096 | 21.11 | 20.96 | -0.7% | | 8192 | 22.55 | 22.50 | -0.2% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 10.59 | 10.35 | -2.3% | | 64 | 19.94 | 19.46 | -2.4% | | 128 | 36.32 | 35.64 | -1.9% | | 256 | 58.84 | 57.80 | -1.8% | | 512 | 87.38 | 87.37 | -0.0% | | 1024 | 119.71 | 120.22 | 0.4% | | 2048 | 147.69 | 147.93 | 0.2% | | 4096 | 167.39 | 167.48 | 0.1% | | 8192 | 179.80 | 179.87 | 0.0% | The results show that, for AES-GCM-128 encrypt, there is still a small slowdown at smaller buffer sizes. This represents the overhead required to make AES-GCM thread safe. These patches have rectified this lack of safety without causing a significant performance impact, especially compared to naive per-buffer cipher context cloning. 3DES-CTR Encrypt ---------------- 1 worker lcore: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.12 | 0.22 | 89.7% | | 64 | 0.16 | 0.22 | 43.6% | | 128 | 0.18 | 0.23 | 22.3% | | 256 | 0.20 | 0.23 | 10.8% | | 512 | 0.21 | 0.23 | 5.1% | | 1024 | 0.22 | 0.23 | 2.7% | | 2048 | 0.22 | 0.23 | 1.3% | | 4096 | 0.23 | 0.23 | 0.4% | | 8192 | 0.23 | 0.23 | 0.4% | 8 worker lcores: | buffer sz (B) | prev (Gbps) | optimised (Gbps) | uplift | |-----------------+---------------+--------------------+----------| | 32 | 0.68 | 1.77 | 160.1% | | 64 | 1.00 | 1.78 | 78.3% | | 128 | 1.29 | 1.80 | 39.6% | | 256 | 1.50 | 1.80 | 19.8% | | 512 | 1.64 | 1.80 | 10.0% | | 1024 | 1.72 | 1.81 | 5.1% | | 2048 | 1.76 | 1.81 | 2.7% | | 4096 | 1.78 | 1.81 | 1.5% | | 8192 | 1.80 | 1.81 | 0.7% | [1/4] yields good results - the performance increase is high for lower buffer sizes, where the cost of re-initialising the extra parameters is more significant compared to the cost of the cipher operation. Full Data and Additional Bar Charts ----------------------------------- The full raw data (CSV) and a PDF of all generated figures (all generated speedup tables, plus additional bar charts showing the throughput comparison across different sets of applied patches) - for both Intel and Arm platforms - are available. However, I'm not sure of the ettiquette regarding attachments of such files, so I haven't attached them for now. If you are interested in reviewing them, please reach out and I will find a way to get them to you. Jack Bond-Preston (5): crypto/openssl: fix GCM and CCM thread unsafe ctxs crypto/openssl: only init 3DES-CTR key + impl once crypto/openssl: per-qp cipher context clones crypto/openssl: per-qp auth context clones crypto/openssl: only set cipher padding once drivers/crypto/openssl/compat.h | 26 ++ drivers/crypto/openssl/openssl_pmd_private.h | 26 +- drivers/crypto/openssl/rte_openssl_pmd.c | 244 ++++++++++++++----- drivers/crypto/openssl/rte_openssl_pmd_ops.c | 35 ++- 4 files changed, 260 insertions(+), 71 deletions(-) -- 2.34.1 </div> </div> </body> </html>