[RFC 0/3] lib/fastmem: fast small-object allocator

Mattias Rönnblom hofors at lysator.liu.se
Mon May 25 21:43:12 CEST 2026


On 5/25/26 20:36, Stephen Hemminger wrote:
> On Mon, 25 May 2026 12:36:39 +0200
> Mattias Rönnblom <hofors at lysator.liu.se> wrote:
> 
>> This RFC introduces fastmem, a general-purpose small-object allocator
>> for DPDK. It is intended to replace per-type mempools with a single
>> allocator that handles arbitrary sizes, grows on demand, and matches
>> mempool-level performance on the hot path.
>>
>> Motivation
>> ----------
>>
>> DPDK applications commonly maintain many mempools — one per object
>> type (connections, sessions, timers, work items). Each must be sized
>> up front, wastes memory when over-provisioned, and cannot serve
>> objects of a different size. Fastmem eliminates this by accepting
>> arbitrary sizes at runtime, backed by a slab allocator that
>> repurposes memory across size classes as demand shifts.
>>
>> Design
>> ------
>>
>> Three-layer architecture:
>>
>> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
>>     reserved lazily (or pre-reserved for deterministic latency).
>>
>> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
>>     The alignment enables O(1) slab lookup from any object pointer
>>     via bitmask — no radix tree or index structure. Slabs move
>>     freely between 18 power-of-2 size classes (8 B to 1 MiB).
>>
>> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
>>     path). Cache misses trigger bulk transfers to/from the shared
>>     bin under a spinlock.
>>
>> Key properties:
>>
>> - Zero per-object metadata in the production build.
>> - NUMA-aware, with per-socket bins and free-slab pools.
>> - DMA-usable memory with O(1) virt-to-IOVA translation.
>> - Bulk alloc/free with all-or-nothing semantics.
>> - Backing memory never returned during lifetime (slabs recycled).
>> - Non-EAL threads supported (bypass cache, take bin lock).
>>
>> API surface
>> -----------
>>
>>    rte_fastmem_init / deinit
>>    rte_fastmem_reserve
>>    rte_fastmem_set_limit / get_limit
>>    rte_fastmem_alloc / alloc_socket
>>    rte_fastmem_alloc_bulk / alloc_bulk_socket
>>    rte_fastmem_free / free_bulk
>>    rte_fastmem_virt2iova
>>    rte_fastmem_cache_flush
>>    rte_fastmem_max_size / classes
>>    rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
>>    rte_fastmem_stats_reset
>>
>> All APIs are marked __rte_experimental.
>>
>> Performance
>> -----------
>>
>> The single-object hot path is roughly 2-3x the cost of mempool
>> and an order of magnitude faster than rte_malloc. Under
>> multi-lcore contention, fastmem scales similarly to mempool,
>> while rte_malloc collapses.
>>
>> Limitations
>> -----------
>>
>> - Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
>> - Power-of-2 classes only; worst-case internal fragmentation ~50%.
>> - Backing memory not reclaimable short of deinit.
>>
>> Future work
>> -----------
>>
>> - Lcore-affine allocations (false-sharing-free by construction).
>> - Mempool ops driver for transparent drop-in use.
>> - Pre-resolved allocator handle binding size class and socket,
>>    eliminating per-call class lookup and enabling an inline
>>    cache-hit fast path.
>> - Debug mode (cookies, double-free detection, poison-on-free).
>> - Telemetry integration.
>> - EAL integration, allowing EAL-internal subsystems to use
>>    fastmem for their small-object allocations.
>>
>> Mattias Rönnblom (3):
>>    doc: add fastmem programming guide
>>    lib: add fastmem library
>>    app/test: add fastmem test suite
>>
>>   app/test/meson.build                  |    3 +
>>   app/test/test_fastmem.c               | 1682 +++++++++++++++++++++++++
>>   app/test/test_fastmem_perf.c          |  997 +++++++++++++++
>>   app/test/test_fastmem_profile.c       |  157 +++
>>   doc/api/doxy-api-index.md             |    1 +
>>   doc/api/doxy-api.conf.in              |    1 +
>>   doc/guides/prog_guide/fastmem_lib.rst |  301 +++++
>>   doc/guides/prog_guide/index.rst       |    1 +
>>   lib/fastmem/meson.build               |    6 +
>>   lib/fastmem/rte_fastmem.c             | 1486 ++++++++++++++++++++++
>>   lib/fastmem/rte_fastmem.h             |  644 ++++++++++
>>   lib/meson.build                       |    1 +
>>   12 files changed, 5280 insertions(+)
>>   create mode 100644 app/test/test_fastmem.c
>>   create mode 100644 app/test/test_fastmem_perf.c
>>   create mode 100644 app/test/test_fastmem_profile.c
>>   create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
>>   create mode 100644 lib/fastmem/meson.build
>>   create mode 100644 lib/fastmem/rte_fastmem.c
>>   create mode 100644 lib/fastmem/rte_fastmem.h
>>
> 
> Largish patchset so did AI review with full claude model.
> 
> Series review: [RFC 0/3] add fastmem allocator
> Reviewed against the v1 RFC posted 2026-05-25.
> 
> 
> [RFC 1/3] doc: add fastmem programming guide
> 
> Info: doc/guides/prog_guide/fastmem_lib.rst -- "\ No newline at end of file"
>     The new RST file does not end with a newline.
> 
> 
> [RFC 2/3] lib: add fastmem library
> 
> Error: lib/fastmem/rte_fastmem.c -- use-after-free during rte_fastmem_deinit()
>     when caches were allocated cross-socket.
> 
>     cache_create() places the cache struct on the *calling thread's* socket,
>     not on the socket the cache serves:
> 
>         unsigned int own_socket = rte_socket_id();
>         ...
>         alloc_socket = &fastmem->sockets[own_socket];
>         cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
>         ...
>         *slot = cache;          /* slot is in socket K's caches[][] */
> 
>     So an lcore on socket S that calls rte_fastmem_alloc_socket(..., K) with
>     S != K creates a cache whose memory lives in socket S's memzone but is
>     reachable through socket K's caches[lcore][class].
> 
>     rte_fastmem_deinit() then walks sockets in index order:
> 
>         for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
>                 release_socket(&fastmem->sockets[i]);
> 
>     and release_socket() does, in this order:
> 
>         socket_release_caches(socket);            /* (1) */
>         for (c...) bin_release(&socket->bins[c], socket);  /* (2) */
>         for (i...) rte_memzone_free(socket->memzones[i]);  /* (3) */
> 
>     When i = S, step (3) frees socket S's memzones. When i = K (K > S),
>     socket_release_caches(K) runs:
> 
>         cache_slab = slab_of(cache);             /* in socket S's freed mz */
>         bin_free_one(cache_slab->bin, cache);    /* reads cache_slab->bin */
> 
>     cache_slab points into a freed memzone, so cache_slab->bin and the
>     subsequent push (slab->free_head = obj; slab->free_count++; in
>     bin_push_locked()) read and write released memory. slab_release() may
>     then re-attach the slab to socket S's free_head, which was zeroed and
>     whose backing is gone.
> 
>     This is triggered by any application that allocates from a non-local
>     socket via SOCKET_ID_ANY fallback or explicit socket_id, which the
>     programming guide describes as a normal mode of operation. The
>     existing test_alloc_socket and test_alloc_socket_numa_placement use
>     rte_socket_id_by_idx(0) (the local socket) so the bug is not
>     exercised by the test suite.
> 
>     Either order the teardown in three phases (all caches across all
>     sockets first, then all bins, then all memzones), or allocate the
>     cache struct from the socket it serves rather than the calling
>     thread's socket.
> 
> Warning: lib/fastmem/rte_fastmem.c -- non-atomic access to shared 64-bit
>     statistics counters.
> 
>     cache->alloc_cache_hits, alloc_cache_misses, alloc_nomem,
>     free_cache_hits, free_cache_misses, and the bin counters
>     slab_acquires, slab_releases, slabs_partial, slabs_full are
>     incremented as plain C reads/writes by the owning lcore and read
>     from another thread via rte_fastmem_stats(), rte_fastmem_stats_class(),
>     rte_fastmem_stats_lcore(), and rte_fastmem_stats_lcore_class(). On
>     architectures where uint64_t is not naturally atomic (and per the C
>     standard generally) this is a data race; even on x86-64 it is
>     undefined behavior under -fsanitize=thread.
> 
>     Use rte_atomic_fetch_add_explicit() with rte_memory_order_relaxed on
>     the producer side and rte_atomic_load_explicit() with relaxed
>     ordering on the reader side. Per AGENTS.md / the DPDK convention,
>     relaxed ordering is appropriate for these counters.
> 
> Warning: lib/fastmem/rte_fastmem.c -- pointer publish in cache_create()
>     without release ordering.
> 
>         *slot = cache;
>         return cache;
> 
>     The struct fields (count, capacity, target, the stats counters) are
>     written before this store but with no fence or release barrier. A
>     concurrent stats reader doing socket->caches[l][c] followed by
>     cache->* could observe the pointer but not all initialized fields.
>     Even ignoring the stats reader, rte_fastmem_cache_flush() invoked
>     from a different lcore on the same cache (not currently possible by
>     API contract, but the field is technically reachable) would race.
>     Pair with rte_atomic_store_explicit(..., rte_memory_order_release)
>     and a matching acquire load on the reader path.
> 
> Warning: lib/fastmem/rte_fastmem.c -- spurious ENOMEM window during slab
>     release.
> 
>     bin_push_locked() removes a fully-drained slab from bin->partial
>     before bin_free_one() drops the bin lock; slab_release() then puts
>     it on socket->free_head under the socket lock. Between the unlock
>     and slab_release(), another lcore allocating in any class on the
>     same socket can see free_head == NULL, hit the memory_limit (or
>     FASTMEM_MAX_MEMZONES_PER_SOCKET) check in grow_socket(), and return
>     ENOMEM even though the slab is about to become available. Not a
>     correctness issue but visible to applications that pin tightly to
>     their limit.
> 
> Info: lib/fastmem/rte_fastmem.c local_socket_id() final fallback:
> 
>         return (unsigned int)rte_socket_id_by_idx(0);
> 
>     rte_socket_id_by_idx() returns int and is documented to return -1 on
>     error. If there are zero configured sockets the cast yields UINT_MAX
>     and fastmem->sockets[UINT_MAX] is out of bounds. Realistically there
>     is always at least one socket, but a defensive check (return 0, or
>     fail allocation explicitly) would avoid the corner case.
> 
> Info: lib/fastmem/rte_fastmem.c cache_pop() refills to cache->target
>     (half capacity) rather than to capacity. Subsequent single-object
>     allocs only get target-1 hits before the next bin trip. Likely
>     intentional for fairness with bulk callers, but worth a comment.
> 
> Info: lib/meson.build inserts 'fastmem' between 'dispatcher' and
>     'gpudev'. The natural alphabetical position is between 'efd' and
>     'fib'; fastmem has no dependency on dispatcher.
> 
> 
> [RFC 3/3] app/test: add fastmem test suite
> 
> Warning: app/test/test_fastmem.c -- REGISTER_FAST_TEST uses NOHUGE_OK
>     but the functional tests need real memzone-backed memory.
> 
>         REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_OK, ASAN_OK,
>                            test_fastmem);
> 
>     test_fastmem runs both the lifecycle suite (no allocations) and the
>     functional suite, which requests 128 MiB IOVA-contiguous memzones.
>     In --no-huge mode IOVA-contiguous reservation of that size is not
>     reliable, so NOHUGE_SKIP is more honest. If you want the lifecycle
>     tests to remain no-huge-friendly, register them as a separate
>     test command.
> 
> Warning: app/test/test_fastmem.c -- the suite never exercises
>     cross-socket cache allocation.
> 
>     test_alloc_socket and test_alloc_socket_numa_placement both use
>     rte_socket_id_by_idx(0) (the local socket). Add a test that runs on
>     a worker lcore whose rte_socket_id() differs from the target
>     socket_id passed to rte_fastmem_alloc_socket(), then calls
>     rte_fastmem_deinit(). This would have caught the deinit UAF above.
> 
> Info: app/test/test_fastmem.c -- several test functions declare an
>     uninitialized `int rc;` that is never read or written (e.g.
>     test_alloc_too_big, test_alloc_invalid_align, test_alloc_free_small,
>     test_alloc_alignment, test_alloc_socket, test_alloc_block_repurposing
>     and others). Drop the declarations.
> 
> Info: app/test/test_fastmem.c trailing blank-line clusters (two blank
>     lines before "return TEST_SUCCESS;" in test_reserve_multiple_memzones,
>     test_reserve_cumulative, test_reserve_invalid_socket,
>     test_reserve_any_socket, test_alloc_too_big, ...). Drop the extra
>     blank line.

Thanks. I've addressed the above issues and the fixes will be available 
as an RFC v2, except:

#2 - Non-atomic stats counters

     Diagnostic counters read cross-thread. On all DPDK-supported
     architectures, aligned uint64_t stores are atomic in practice;
     a torn read (e.g., on 32-bit x86) at worst yields a slightly
     stale counter value. Not worth the ceremony.

#3 - Pointer publish without release ordering

     On weakly-ordered architectures a stats reader could briefly see
     uninitialized counter values for a newly-created cache. Acceptable
     for diagnostic data.

#4 - Spurious ENOMEM window during slab release

     Narrow timing window, not a correctness bug. Closing it would
     require holding the bin lock across slab_release(), reintroducing
     the contention the design avoids.



More information about the dev mailing list