[PATCH 1/7] net/netvsc: retry VF hotplug indefinitely until PCI device disappears
Stephen Hemminger
stephen at networkplumber.org
Wed May 6 00:03:22 CEST 2026
On Tue, 5 May 2026 13:44:50 -0700
Long Li <longli at microsoft.com> wrote:
> After PCI rescan on Azure, the MANA kernel driver can take over 100
> seconds to probe and create the /sys/bus/pci/devices/<dev>/net directory.
> The previous fixed retry limit (NETVSC_MAX_HOTADD_RETRY=10, ~12 seconds)
> was insufficient, causing VF re-attach to fail with 'Failed to parse PCI
> device' on systems with slow MANA driver initialization.
>
> Replace the fixed retry limit with an indefinite retry that only gives up
> when the PCI device itself disappears from sysfs. This is safe because:
>
> - The retry uses rte_eal_alarm callbacks which are serialized on the EAL
> interrupt thread, preventing races with VF remove or device close paths.
> - Device close (eth_hn_dev_uninit) explicitly cancels all pending hotplug
> alarms via rte_eal_alarm_cancel and frees the context.
> - If the PCI device is removed while retrying, access() detects the
> missing sysfs path and stops immediately.
>
> A periodic NOTICE log every 30 retries (~30s) provides visibility into
> long waits without flooding the log at DEBUG level.
>
> Fixes: a2a23a794b3a ("net/netvsc: support VF device hot add/remove")
> Cc: stable at dpdk.org
> Signed-off-by: Long Li <longli at microsoft.com>
> ---
Since this series touches lots of stuff, decided to do a more through than
normal AI review
Review of patch series: net/netvsc VF hotplug improvements (7 patches)
Series targets stale interface names, slow MANA driver probe, multi-VF
PCI devices, and adds service-reset event handling. Reviewed against
upstream main as of clone date.
=========================================================================
[PATCH 3/7] net/netvsc: retry full probe when IB device not ready during
hotplug
=========================================================================
Warning:
drivers/net/netvsc/hn_ethdev.c (around the new ENODEV/EEXIST handling)
The original code logged failures from rte_eal_hotplug_add at
PMD_DRV_LOG(ERR). The patch lowers all non-ENODEV, non-EEXIST returns
to PMD_DRV_LOG(NOTICE):
if (ret && ret != -EEXIST)
PMD_DRV_LOG(NOTICE,
"Failed to add PCI device %s (ret=%d)",
d->name, ret);
-ENOMEM, -EBUSY, -EIO and similar errors from rte_eal_hotplug_add are
legitimate failures that operators need to see. Recommend keeping ERR
for unrecognized error codes; only -ENODEV and -EEXIST are now expected
outcomes worth a quieter log level.
=========================================================================
[PATCH 4/7] net/netvsc: add NOTICE-level debug logging for VF hotplug
retry
=========================================================================
Warning:
drivers/net/netvsc/hn_ethdev.c, multiple sites in netvsc_hotplug_retry
The new log calls inside the readdir() loop fire at NOTICE for every
directory entry on every retry pass:
PMD_DRV_LOG(NOTICE, "%s: checking interface %s in %s (retry %d)", ...)
PMD_DRV_LOG(NOTICE, "%s: device %s sa_family=%d not ARPHRD_ETHER, ...")
PMD_DRV_LOG(NOTICE, "%s: MAC mismatch for %s: got ... expected ...")
Combined with patch 1 (indefinite retries) and patch 5 (up to 120
mac_retry passes), a multi-NIC VM can produce hundreds of NOTICE lines
per VF re-attach. Per-iteration trace belongs at DEBUG; reserve NOTICE
for one-shot state transitions (loop start, match found, give-up).
Info:
drivers/net/netvsc/hn_ethdev.c, MAC mismatch log
The format string and byte arguments expand the MAC inline:
"%02x:%02x:%02x:%02x:%02x:%02x ..."
eth_addr.addr_bytes[0], ..., eth_addr.addr_bytes[5],
dev->data->mac_addrs->addr_bytes[0], ...
rte_ether.h provides RTE_ETHER_ADDR_PRT_FMT and RTE_ETHER_ADDR_BYTES()
for exactly this; using them is shorter and consistent with the rest
of DPDK.
Info:
drivers/net/netvsc/hn_ethdev.c, MAC match block
The new "} else {" follows an if-branch that already ends in break, so
the else is redundant — the log can sit unindented after the closing
brace of the if.
=========================================================================
[PATCH 5/7] net/netvsc: retry when no matching MAC found in net directory
=========================================================================
Error:
Commit message vs. code disagree on the retry budget.
Commit message:
"This uses a separate mac_retry counter (limit 30, ~30 seconds) so
the main retry loop remains unlimited."
Code:
/* Max retries when net/ directory exists but no matching MAC found.
* On multi-NIC PCI devices, a second VF may register later.
* 120 retries = ~2 minutes.
*/
#define NETVSC_MAX_MAC_RETRY 120
Either the commit message or the macro value needs to be corrected so
they agree. With NETVSC_HOTADD_RETRY_INTERVAL = 1s, 120 retries is
~2 min, not ~30 s.
Info:
drivers/net/netvsc/hn_ethdev.c, in the new "no MAC match" block
if (di) {
closedir(di);
di = NULL;
DPDK style requires explicit pointer comparison: "if (di != NULL)".
=========================================================================
[PATCH 6/7] net/netvsc: forward per-queue stats from VF device
=========================================================================
Warning:
drivers/net/netvsc/hn_vf.c, hn_vf_stats_get
if (vf_dev->dev_ops->stats_get)
ret = vf_dev->dev_ops->stats_get(vf_dev, stats, qstats);
else
ret = rte_eth_stats_get(vf_dev->data->port_id, stats);
Reaching directly into another ethdev's dev_ops->stats_get is not a
pattern used anywhere else in DPDK (bonding, failsafe, etc. all go
through rte_eth_stats_get). The shortcut bypasses three things the
public wrapper does:
- RTE_ETH_VALID_PORTID_OR_ERR_RET on the VF port_id
- memset(stats, 0) and stats->rx_nombuf = data->rx_mbuf_alloc_failed
- eth_err() return-code translation
The memset is harmless here because the outer ethdev wrapper has
already zeroed stats/qstats for the netvsc port. The rx_nombuf
attribution, however, changes: previously stats->rx_nombuf was
pre-loaded with the VF's data->rx_mbuf_alloc_failed before the VF's
callback ran; now it's whatever was set for the netvsc port. Worth
documenting in the commit message if intentional.
Also, the dev_ops pointer test uses an implicit comparison:
if (vf_dev->dev_ops->stats_get)
Per DPDK style this should be "!= NULL".
A cleaner long-term fix would be to extend the ethdev API so per-queue
stats can be forwarded through the public path, or wrap this dispatch
in a small helper with a comment explaining why the public API is
insufficient.
Info:
drivers/net/netvsc/hn_vf.c
Some VF drivers may return -ENOTSUP or partial data when called with a
non-NULL qstats. The patch unconditionally propagates that return,
which could turn a previously-successful stats_get into a failure for
netvsc users on certain VF drivers. A graceful fallback to
rte_eth_stats_get on -ENOTSUP would preserve backward compatibility.
=========================================================================
[PATCH 7/7] net/netvsc: handle VF recovery events for service reset
=========================================================================
Warning:
drivers/net/netvsc/hn_vf.c, hn_eth_recovery_success_callback
rte_rwlock_write_lock(&hv->vf_lock);
if (hv->vf_ctx.vf_attached && !hv->vf_ctx.vf_vsc_switched) {
ret = hn_nvs_set_datapath(hv, NVS_DATAPATH_VF);
...
}
hn_vf_add_unlocked guards the equivalent NVS_DATAPATH_VF switch with
"if (dev->data->dev_started)" precisely to avoid routing traffic to
the VF before queues are configured. The new callback omits that
check. If the user calls rte_eth_dev_stop() on the netvsc port during
the ERR_RECOVERING -> RECOVERY_SUCCESS window, this callback will
switch the host data path to a VF whose DPDK queues may have been
torn down. Recommend mirroring the dev_started check.
Warning:
drivers/net/netvsc/hn_vf.c, hn_eth_recovering_callback and
hn_eth_recovery_success_callback
Both new callbacks acquire hv->vf_lock as a writer directly in the
event-callback context. The existing hn_eth_rmv_event_callback
intentionally does not take vf_lock — it defers via:
rte_eal_alarm_set(1, hn_remove_delayed, hv);
to break a possible lock-order coupling with the calling driver (the
PMD that fires rte_eth_dev_callback_process may itself be holding an
internal lock). Taking vf_lock directly inside the recovery callbacks
introduces a different lock-ordering invariant from the rest of the
file. Consider either deferring through rte_eal_alarm_set as well, or
adding a comment explaining why direct acquisition is safe in the
MANA -> netvsc invocation path.
Info:
drivers/net/netvsc/hn_vf.c, hn_eth_recovery_failed_callback
rte_eal_alarm_set(1, hn_remove_delayed, hv);
The state of hv->vf_ctx.vf_attached is not checked. If RECOVERY_FAILED
arrives after a concurrent INTR_RMV has already detached the VF, the
alarm will fire hn_remove_delayed against an already-removed port,
generating a spurious "Start to remove port 0" log and downstream
unregister failures. The damage is limited because hn_remove_delayed
takes the lock and re-checks state, but the noise is avoidable with a
vf_attached guard before scheduling the alarm.
=========================================================================
General notes on the series
=========================================================================
Patches 1, 2, and 5 together turn netvsc_hotplug_retry into a
genuinely-unbounded loop in two dimensions:
- opendir failures retry indefinitely (patch 1, capped only by PCI
device disappearance)
- SIOCGIFHWADDR failures and -ENODEV from rte_eal_hotplug_add reschedule
without incrementing any counter (patches 2, 3)
- "no matching MAC" retries are bounded by mac_retry / 120 (patch 5)
Only patch 5's path has a hard upper bound. The other paths rely on the
PCI device eventually disappearing if the VF never comes up. On a
permanently broken VF that stays present in sysfs but never registers
a netdev, the EAL alarm thread will spin forever. Worth confirming that
this is the intended behaviour, or adding an overall ceiling (e.g., a
much higher cap than 30s but still finite) for the opendir / ioctl /
ENODEV paths.
More information about the dev
mailing list