[PATCH v7 0/5] Support add/remove memory region & get-max-slots
Bathija, Pravin
Pravin.Bathija at dell.com
Thu Apr 2 10:07:32 CEST 2026
Dear Stephen,
Thank you again for taking the time to review my patch-set. I have incorporated your suggestions and submitted patch-set v8.
The responses to your comments are inline
Internal Use - Confidential
> -----Original Message-----
> From: Stephen Hemminger <stephen at networkplumber.org>
> Sent: Monday, March 30, 2026 8:26 PM
> To: Bathija, Pravin <Pravin.Bathija at dell.com>
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> fengchengwen at huawei.com
> Subject: Re: [PATCH v7 0/5] Support add/remove memory region & get-max-
> slots
>
>
> [EXTERNAL EMAIL]
>
> On Wed, 11 Feb 2026 10:24:28 +0000
> <pravin.bathija at dell.com> wrote:
>
> > From: Pravin M Bathija <pravin.bathija at dell.com>
> >
> > This is version v7 of the patchset and it incorporates the
> > recommendations made by Maxime Coquelin and Feng Cheng Wen.
> > The patchset includes support for adding and removal of memory
> > regions, getting max memory slots and other changes to vhost-user
> > messages. These messages are sent from vhost-user front-end (qemu or
> > libblkio) to a vhost-user back-end (dpdk, spdk). Support functions for
> > these message functions have been implemented in the interest of
> > writing optimized code. Older functions, part of vhost-user back-end
> > have also been optimized using these newly defined support functions.
> > This implementation has been extensively tested by doing Read/Write
> > I/O from multiple instances of fio + libblkio (front-end) talking to
> > spdk/dpdk (back-end) based drives. Tested with qemu front-end talking
> > to dpdk + testpmd (back-end) performing add/removal of memory regions.
> > Also tested post-copy live migration after doing add_memory_region.
> >
> > Version Log:
> > Version v7 (Current version): Incorporate code review suggestions from
> > Maxime Coquelin. Add debug messages to vhost_postcopy_register
> > function.
> > Version v6: Added the enablement of this feature as a final patch in
> > this patch-set and other code optimizations as suggested by Maxime
> > Coquelin.
> > Version v5: removed the patch that increased the number of memory
> > regions from 8 to 128. This will be submitted as a separate feature at
> > a later point after incorporating additional optimizations. Also
> > includes code optimizations as suggested by Feng Cheng Wen.
> > Version v4: code optimizations as suggested by Feng Cheng Wen.
> > Version v3: code optimizations as suggested by Maxime Coquelin and
> > Thomas Monjalon.
> > Version v2: code optimizations as suggested by Maxime Coquelin.
> > Version v1: Initial patch set.
> >
> > Pravin M Bathija (5):
> > vhost: add user to mailmap and define to vhost hdr
> > vhost_user: header defines for add/rem mem region
> > vhost_user: support function defines for back-end
> > vhost_user: Function defs for add/rem mem regions
> > vhost_user: enable configure memory slots
> >
> > .mailmap | 1 +
> > lib/vhost/rte_vhost.h | 4 +
> > lib/vhost/vhost_user.c | 392
> > ++++++++++++++++++++++++++++++++++++-----
> > lib/vhost/vhost_user.h | 10 ++
> > 4 files changed, 365 insertions(+), 42 deletions(-)
> >
>
> Lots more stuff found by AI code review, this is not ready.
>
> Review of [PATCH v7 1-5/5] vhost: configure memory slots support
> Author: Pravin M Bathija <pravin.bathija at dell.com>
>
> This patch series adds support for the
> VHOST_USER_F_CONFIGURE_MEM_SLOTS protocol feature, enabling
> add/remove individual memory regions instead of requiring a full
> SET_MEM_TABLE each time. The approach is sound and the feature is needed,
> but several correctness bugs need to be addressed.
>
> Patch 3/5 -- vhost_user: support function defines for back-end
> --------------------------------------------------------------------
>
> Error: dev_invalidate_vrings loses *pdev update after numa_realloc
>
> translate_ring_addresses() calls numa_realloc(), which can reallocate
> the dev struct and update *pdev. In the original set_mem_table the
> caller writes "*pdev = dev;" after the call to observe this.
>
> dev_invalidate_vrings() takes a plain "struct virtio_net *dev" and
> passes "&dev, &vq" to translate_ring_addresses -- but dev is a local
> copy. If numa_realloc reallocates, the caller's *pdev (in
> vhost_user_add_mem_reg) is never updated, and all subsequent accesses
> use a stale/freed pointer -- a use-after-free.
>
> The function should take struct virtio_net **pdev (matching the
> pattern used in set_mem_table) and propagate updates back.
>
Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net **pdev. The function dereferences *pdev locally,
passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to propagate any reallocation. Both callers
(add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev afterward.
> Error: async_dma_map_region can underflow reg_size
>
> The loop "reg_size -= page->size" uses unsigned subtraction. If
> page->size does not evenly divide reg->size (which can happen with
> misaligned guest pages), reg_size wraps to a huge value and the loop
> runs out of bounds. The original async_dma_map avoids this by
> iterating over nr_guest_pages directly.
>
> Suggest checking "reg_size >= page->size" before subtracting, or
> better yet iterating like the original does.
>
Fixed. Rewrote async_dma_map_region to iterate all dev->guest_pages entries and select only those whose host_user_addr falls
within [reg->host_user_addr, reg->host_user_addr + reg->size). This eliminates the contiguous-index assumption entirely.
This rewrite eliminates reg_size tracking entirely. The function now iterates all guest pages and filters by host_user_addr range,
so there is no subtraction that can underflow.
> Warning: free_all_mem_regions iterates all
> VHOST_MEMORY_MAX_NREGIONS
> but the original free_mem_region iterated only dev->mem->nregions
>
> This is intentional for the sparse-slot design, but free_mem_region()
> checks "reg->host_user_addr" while free_all_mem_regions checks
> "reg->mmap_addr" as the sentinel. These should be consistent.
> A region that was partially initialized (mmap_addr set but
> host_user_addr not yet set, or vice versa) would behave
> differently depending on which sentinel is checked.
>
Fixed. Changed free_mem_region to check reg->mmap_addr instead of reg->host_user_addr, making both functions consistent.
> Warning: vhost_user_initialize_memory unconditionally allocates
> dev->mem without freeing any prior allocation
>
> If dev->mem is non-NULL when vhost_user_initialize_memory is called,
> the old allocation is leaked. set_mem_table frees dev->mem before
> calling it, but add_mem_reg only checks "dev->mem == NULL" and skips
> the call otherwise. This is safe as coded, but the function itself
> is fragile -- consider adding an assertion or early return if
> dev->mem is already set.
>
Fixed. Added a guard at the top of vhost_user_initialize_memory: if dev->mem != NULL, log an error and return -1. This makes
the function safe against accidental double-initialization.
> Info: vhost_user_postcopy_register changes use reg_msg_index but the
> second loop (postcopy_region_register) still iterates dev->mem->regions
> with nregions as the bound and just skips zeros. This is correct
> but the two loops use different iteration strategies which is
> confusing. Consider making both iterate VHOST_MEMORY_MAX_NREGIONS
> with a skip-zero check for consistency.
Fixed differently. vhost_user_add_mem_reg no longer calls vhost_user_postcopy_register() at all. That function reads ctx-
>msg.payload.memory (the SET_MEM_TABLE union member), but ADD_MEM_REG uses the memory_single payload — reading the wrong union
member would yield garbage for nregions. Instead, add_mem_reg now calls vhost_user_postcopy_region_register(dev, reg) directly
for the single new region, bypassing the message-based iteration entirely.
>
>
> Patch 4/5 -- vhost_user: Function defs for add/rem mem regions
> --------------------------------------------------------------------
>
> Error: vhost_user_add_mem_reg does not set ctx->fds[0] = -1 after
> assigning reg->fd = ctx->fds[0]
>
> In the original set_mem_table, after "reg->fd = ctx->fds[i]" the
> code sets "ctx->fds[i] = -1" to prevent the fd from being double-
> closed if the error path calls close_msg_fds(). The new add_mem_reg
> omits this. If vhost_user_mmap_region or vhost_user_postcopy_register
> fails, the error path calls close_msg_fds(ctx) which closes
> ctx->fds[0] -- but reg->fd still holds the same value. Later,
> free_all_mem_regions will close(reg->fd) again -- double close.
>
> Fix: add "ctx->fds[0] = -1;" after "reg->fd = ctx->fds[0];".
>
Fixed. Added ctx->fds[0] = -1; immediately after reg->fd = ctx->fds[0]; to prevent close_msg_fds from double-closing the fd on
error paths.
> Error: vhost_user_add_mem_reg error path frees all memory on
> single-region failure
>
> If postcopy_register fails after successfully mmap'ing one region
> (when other regions already exist), the "free_mem_table" label
> calls free_all_mem_regions + rte_free(dev->mem) + rte_free
> (dev->guest_pages), destroying ALL previously registered regions.
> This is disproportionate -- the error should only unmap the single
> region that was just added and decrement nregions.
>
Fixed. Replaced the free_mem_table label with free_new_region, which only cleans up the single failed region: DMA unmaps the
region, removes its guest page entries, calls free_mem_region on the single region, and decrements nregions. Existing regions
are untouched.
> Error: overlap check uses mmap_size for existing region but
> memory_size for proposed region
>
> The overlap check compares:
> current_region_guest_end = guest_user_addr + mmap_size - 1
> proposed_region_guest_end = userspace_addr + memory_size - 1
>
> mmap_size = memory_size + mmap_offset (set by
> vhost_user_mmap_region).
> So the existing region's range is inflated by the mmap_offset,
> potentially producing false overlaps. Use current_region->size
> (which corresponds to memory_size) instead of mmap_size for a
> correct comparison.
>
Fixed. Changed the overlap check to use current_region->size (which corresponds to memory_size from the protocol) instead of
current_region->mmap_size, in both the range calculation and the error log message.
> Error: vhost_user_add_mem_reg uses guest_user_addr == 0 as "slot
> is empty" but guest_user_addr 0 is a valid guest virtual address
>
> The free-slot search does:
> if (dev->mem->regions[i].guest_user_addr == 0)
>
> Guest address 0 is valid (common in some VM configurations).
> If a region is mapped at guest VA 0, it would appear as a free
> slot. The original code uses compact packing (nregions as the
> count), avoiding this problem. Consider using a dedicated
> "in_use" flag or checking mmap_addr == NULL instead (mmap
> never returns NULL on success, it returns MAP_FAILED).
>
Fixed. The regions array is now kept compact (contiguous from index 0 to nregions-1) via memmove compaction in rem_mem_reg.
New regions are always placed at regions[dev->mem->nregions] — no free-slot search is needed, eliminating the guest_user_addr
== 0 sentinel entirely.
> Warning: vhost_user_rem_mem_reg matches on guest_phys_addr but the
> spec says "identified by guest address, user address and size"
>
> The comment in the code says "identified by its guest address,
> user address and size. The mmap offset is ignored." but the
> comparison also checks guest_phys_addr (which is the GPA, not
> the guest user address). The spec says matching is by
> userspace_addr (GVA), guest_phys_addr (GPA), and memory_size.
> So the match is actually correct in practice, but the comment
> is misleading -- it should mention GPA too, or remove the
> comment about what the spec says.
>
Fixed. Updated the comment to: "The memory region to be removed is identified by its GPA, user address and size."
> Warning: nregions becomes inconsistent with actual slot usage
>
> After add_mem_reg, nregions is incremented. After rem_mem_reg,
> nregions is decremented. But regions are not packed -- removing
> a region from the middle leaves a hole. Functions like
> qva_to_vva, hva_to_gpa, and rte_vhost_va_from_guest_pa all
> iterate "i < mem->nregions" and index "mem->regions[i]"
> sequentially. With sparse slots these functions will miss
> regions that are stored at indices >= nregions.
>
> Example: 3 regions at slots 0,1,2 (nregions=3). Remove slot 1
> (memset to 0, nregions=2). Now qva_to_vva iterates slots 0,1
> only -- slot 2 (still valid) is never checked. Address
> translations for memory in the third region will fail.
>
> This is a correctness bug that will cause packet processing
> failures. Either pack the array on removal (shift later entries
> down) or change all iteration to use VHOST_MEMORY_MAX_NREGIONS
> with skip-zero checks.
>
Fixed. rem_mem_reg now compacts the regions array after removal using memmove to shift later entries down, then zeroes the
trailing slot. This keeps the array contiguous from index 0 to nregions-1, so all existing iteration using i < nregions
continues to work correctly.
> Warning: dev_invalidate_vrings does not update *pdev in add_mem_reg
>
> Related to the Error in patch 3 -- even if dev_invalidate_vrings
> is fixed to take **pdev, the caller vhost_user_add_mem_reg must
> also write "*pdev = dev;" to propagate the possibly-reallocated
> pointer back through the message handler framework.
>
Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net **pdev. The function dereferences *pdev locally,
passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to propagate any reallocation. Both callers
(add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev afterward.
>
> Patch 5/5 -- vhost_user: enable configure memory slots
> --------------------------------------------------------------------
>
> Warning: the feature is enabled unconditionally but several
> address translation functions (qva_to_vva, hva_to_gpa,
> rte_vhost_va_from_guest_pa) have not been updated to handle
> sparse region arrays. Enabling the feature flag before fixing
> the nregions iteration issue means any front-end that uses
> ADD_MEM_REG/REM_MEM_REG will hit broken address translations.
>
Resolved by the compaction fix in patch 4 (comment #18). Since the regions array is always kept contiguous, these functions
work correctly without modification. No changes needed in patch 5.
>
> Summary
> --------------------------------------------------------------------
>
> The most critical issues to fix before this can be merged:
>
> 1. nregions vs sparse-slot iteration mismatch -- this will cause
> address translation failures at runtime after any REM_MEM_REG.
> All loops that iterate mem->regions must either use compact
> packing or be updated to scan all VHOST_MEMORY_MAX_NREGIONS
> slots with skip-zero checks.
>
> 2. dev_invalidate_vrings / translate_ring_addresses *pdev propagation
> -- use-after-free if numa_realloc fires.
>
> 3. Missing ctx->fds[0] = -1 in add_mem_reg -- double-close on
> error paths.
>
> 4. Overlap check using mmap_size instead of size -- false positive
> overlaps.
>
> 5. Error path in add_mem_reg destroys all regions instead of just
> the failed one.
Additional fix found during testing
VHOST_USER_REM_MEM_REG registered with accepts_fd = false
The vhost-user protocol sends an FD with REM_MEM_REG messages. The handler was registered with accepts_fd = false, causing
validate_msg_fds to reject every REM_MEM_REG message with "expect 0 FDs, received 1". Fixed by changing the registration to
accepts_fd = true and adding close_msg_fds(ctx) calls in all exit paths of rem_mem_reg to properly close the received FD.
More information about the dev
mailing list