|SUCCESS| [v3] mempool: improve cache behaviour and performance
dpdklab at iol.unh.edu
dpdklab at iol.unh.edu
Thu Apr 9 18:45:57 CEST 2026
Test-Label: ai-code-review
Test-Status: SUCCESS
https://dpdk.org/patch/162964
_AI Code Review_
=== Patch Review: series.patch (via Claude) ===
Review date: 2026-04-09
git sha: 20260409110558.961159-1-mb at smartsharesystems.com
Author: Morten Brorup <mb at smartsharesystems.com>
Subject: [PATCH v3] mempool: improve cache behaviour and performance
This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 2e54fc4466..aa2d51bbd5 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1390,24 +1397,27 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
> RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
>
> - __rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> - __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> - __rte_assume(cache->len <= cache->flushthresh);
> - if (likely(cache->len + n <= cache->flushthresh)) {
> + __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> + __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> + __rte_assume(cache->len <= cache->size);
> + if (likely(cache->len + n <= cache->size)) {
> /* Sufficient room in the cache for the objects. */
> cache_objs = &cache->objs[cache->len];
> cache->len += n;
> - } else if (n <= cache->flushthresh) {
> + } else if (n <= cache->size / 2) {
Does this code cause issues in corner cases? With cache->size as 1, this
branch is evaluating n <= 0. If n can be 1, does that mean a
single-element cache loops forever or becomes unable to put?
> + /* Fill the cache from the backend; fetch (size / 2) objects. */
> + ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
Can this code overflow cache->objs[]? If cache->len isn't zero and
cache->size / 2 objects are fetched into objs[0], does that mean
cache->len and the new objects step on each other?
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> index 9af275cd9d..59a6c22e98 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> @@ -148,14 +148,20 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
> /* Can this be satisfied from the cache? */
> if (cache->len < IDPF_RXQ_REARM_THRESH) {
> /* No. Backfill the cache first, and then fill from it */
> - uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> - cache->len);
>
> - /* How many do we require i.e. number to fill the cache + the request */
> + /* Backfill would exceed the cache bounce buffer limit? */
> + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> + if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
> + idpf_singleq_rearm_common(rxq);
> + return;
> + }
> +
> + /* Backfill the cache from the backend; fetch (size / 2) objects. */
> + __rte_assume(cache->len < cache->size / 2);
> int ret = rte_mempool_ops_dequeue_bulk
> - (rxq->mp, &cache->objs[cache->len], req);
> + (rxq->mp, &cache->objs[cache->len], cache->size / 2);
Can this code overflow cache->objs[]? If cache->len is non-zero, placing
cache->size / 2 objects starting at cache->objs[cache->len] could go past
the end of the array.
> diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
> index 283bd58d5d..eeb0980d40 100644
> --- a/drivers/net/intel/common/tx.h
> +++ b/drivers/net/intel/common/tx.h
> @@ -284,43 +284,13 @@ ci_tx_free_bufs_vec(struct ci_tx_queue *txq, ci_desc_done_fn desc_done, bool ctx
> txq->fast_free_mp :
> (txq->fast_free_mp = txep[0].mbuf->pool);
>
> - if (mp != NULL && (n & 31) == 0) {
> - void **cache_objs;
> - struct rte_mempool_cache *cache = rte_mempool_default_cache(mp, rte_lcore_id());
> -
> - if (cache == NULL)
> - goto normal;
> -
> - cache_objs = &cache->objs[cache->len];
> -
> - if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
> - rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
> - goto done;
> - }
> -
> - /* The cache follows the following algorithm
> - * 1. Add the objects to the cache
> - * 2. Anything greater than the cache min value (if it
> - * crosses the cache flush threshold) is flushed to the ring.
> - */
> - /* Add elements back into the cache */
> - uint32_t copied = 0;
> - /* n is multiple of 32 */
> - while (copied < n) {
> - memcpy(&cache_objs[copied], &txep[copied], 32 * sizeof(void *));
> - copied += 32;
> - }
> - cache->len += n;
> -
> - if (cache->len >= cache->flushthresh) {
> - rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> - cache->len - cache->size);
> - cache->len = cache->size;
> - }
> + if (mp != NULL) {
> + static_assert(sizeof(*txep) == sizeof(struct rte_mbuf *),
> + "txep array is not similar to an array of rte_mbuf pointers");
> + rte_mbuf_raw_free_bulk(mp, (void *)txep, n);
Does this code introduce a performance regression? The removed code path
took advantage of n being a multiple of 32 for optimized copy loops. Can
the replacement rte_mbuf_raw_free_bulk() path maintain the same fast-path
performance when n is aligned?
More information about the test-report
mailing list