[PATCH v2 2/8] net/netvsc: fix race conditions on VF add/remove events
Stephen Hemminger
stephen at networkplumber.org
Mon Feb 23 18:43:58 CET 2026
On Fri, 20 Feb 2026 18:45:21 -0800
longli at linux.microsoft.com wrote:
> From: Long Li <longli at microsoft.com>
>
> Netvsc gets notification from VSP on VF add/remove over VMBUS, but the
> timing may not match the DPDK sequence of device events triggered from
> uevents from kernel.
>
> Remove the retry logic from the code when attach to VF and rely on DPDK
> event to attach to VF. With this change, both the notifications from VSP
> and the DPDK will attempt a VF attach.
>
> Also implement locking when checking on all VF related fields.
>
> Fixes: a2a23a794b3a ("net/netvsc: support VF device hot add/remove")
> Cc: stable at dpdk.org
>
> Signed-off-by: Long Li <longli at microsoft.com>
AI review spotted related issue.
**Patch 2 (net/netvsc: fix race conditions on VF add/remove events)** — the most complex patch in the series.
**What it fixes correctly:**
The old Tx/Rx paths had a TOCTOU race: they checked `vf_vsc_switched` without the lock, acquired the lock, then re-checked. A VF remove could complete between the first check and the lock acquisition. The new code takes the read lock *before* any VF state checks — correct fix. The lock is properly released on both paths.
The upgrade of `hn_vf_close()` from read lock to write lock is also a real bug fix, since it modifies `vf_attached` and calls `rte_eth_dev_close()`.
Moving callback registration into `hn_vf_attach()` with proper rollback (via the new `hn_vf_detach()` helper) is a good structural improvement that ties callback lifetime to VF attach/detach lifecycle.
The unconditional clear of `vf_vsc_switched` in `hn_vf_remove_unlocked()` is correct — if the VF is being removed, the switched flag must be cleared regardless of whether `hn_nvs_set_datapath(SYNTHETIC)` succeeded.
**One potential concern (~60% confidence):**
If the VF is successfully configured and started but `hn_nvs_set_datapath(VF)` fails at the `switch_data_path:` label, the function returns an error but leaves the VF started and attached. The callers don't clean this up. This is a pre-existing design issue the patch doesn't worsen, and the hypervisor may retry, but it could confuse subsequent add/remove cycles.
More information about the stable
mailing list