[PATCH 1/7] net/netvsc: retry VF hotplug indefinitely until PCI device disappears

Stephen Hemminger stephen at networkplumber.org
Wed May 6 00:03:22 CEST 2026
Previous message (by thread): [PATCH 7/7] net/netvsc: handle VF recovery events for service reset
Next message (by thread): [EXTERNAL] Re: [PATCH 1/7] net/netvsc: retry VF hotplug indefinitely until PCI device disappears
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue,  5 May 2026 13:44:50 -0700
Long Li <longli at microsoft.com> wrote:

> After PCI rescan on Azure, the MANA kernel driver can take over 100
> seconds to probe and create the /sys/bus/pci/devices/<dev>/net directory.
> The previous fixed retry limit (NETVSC_MAX_HOTADD_RETRY=10, ~12 seconds)
> was insufficient, causing VF re-attach to fail with 'Failed to parse PCI
> device' on systems with slow MANA driver initialization.
> 
> Replace the fixed retry limit with an indefinite retry that only gives up
> when the PCI device itself disappears from sysfs. This is safe because:
> 
> - The retry uses rte_eal_alarm callbacks which are serialized on the EAL
>   interrupt thread, preventing races with VF remove or device close paths.
> - Device close (eth_hn_dev_uninit) explicitly cancels all pending hotplug
>   alarms via rte_eal_alarm_cancel and frees the context.
> - If the PCI device is removed while retrying, access() detects the
>   missing sysfs path and stops immediately.
> 
> A periodic NOTICE log every 30 retries (~30s) provides visibility into
> long waits without flooding the log at DEBUG level.
> 
> Fixes: a2a23a794b3a ("net/netvsc: support VF device hot add/remove")
> Cc: stable at dpdk.org
> Signed-off-by: Long Li <longli at microsoft.com>
> ---

Since this series touches lots of stuff, decided to do a more through than
normal AI review

Review of patch series: net/netvsc VF hotplug improvements (7 patches)

Series targets stale interface names, slow MANA driver probe, multi-VF
PCI devices, and adds service-reset event handling. Reviewed against
upstream main as of clone date.

=========================================================================
[PATCH 3/7] net/netvsc: retry full probe when IB device not ready during
                       hotplug
=========================================================================

Warning:

  drivers/net/netvsc/hn_ethdev.c (around the new ENODEV/EEXIST handling)

  The original code logged failures from rte_eal_hotplug_add at
  PMD_DRV_LOG(ERR). The patch lowers all non-ENODEV, non-EEXIST returns
  to PMD_DRV_LOG(NOTICE):

      if (ret && ret != -EEXIST)
              PMD_DRV_LOG(NOTICE,
                          "Failed to add PCI device %s (ret=%d)",
                          d->name, ret);

  -ENOMEM, -EBUSY, -EIO and similar errors from rte_eal_hotplug_add are
  legitimate failures that operators need to see. Recommend keeping ERR
  for unrecognized error codes; only -ENODEV and -EEXIST are now expected
  outcomes worth a quieter log level.


=========================================================================
[PATCH 4/7] net/netvsc: add NOTICE-level debug logging for VF hotplug
                       retry
=========================================================================

Warning:

  drivers/net/netvsc/hn_ethdev.c, multiple sites in netvsc_hotplug_retry

  The new log calls inside the readdir() loop fire at NOTICE for every
  directory entry on every retry pass:

    PMD_DRV_LOG(NOTICE, "%s: checking interface %s in %s (retry %d)", ...)
    PMD_DRV_LOG(NOTICE, "%s: device %s sa_family=%d not ARPHRD_ETHER, ...")
    PMD_DRV_LOG(NOTICE, "%s: MAC mismatch for %s: got ... expected ...")

  Combined with patch 1 (indefinite retries) and patch 5 (up to 120
  mac_retry passes), a multi-NIC VM can produce hundreds of NOTICE lines
  per VF re-attach. Per-iteration trace belongs at DEBUG; reserve NOTICE
  for one-shot state transitions (loop start, match found, give-up).

Info:

  drivers/net/netvsc/hn_ethdev.c, MAC mismatch log

  The format string and byte arguments expand the MAC inline:

      "%02x:%02x:%02x:%02x:%02x:%02x ..."
      eth_addr.addr_bytes[0], ..., eth_addr.addr_bytes[5],
      dev->data->mac_addrs->addr_bytes[0], ...

  rte_ether.h provides RTE_ETHER_ADDR_PRT_FMT and RTE_ETHER_ADDR_BYTES()
  for exactly this; using them is shorter and consistent with the rest
  of DPDK.

Info:

  drivers/net/netvsc/hn_ethdev.c, MAC match block

  The new "} else {" follows an if-branch that already ends in break, so
  the else is redundant — the log can sit unindented after the closing
  brace of the if.


=========================================================================
[PATCH 5/7] net/netvsc: retry when no matching MAC found in net directory
=========================================================================

Error:

  Commit message vs. code disagree on the retry budget.

  Commit message:
      "This uses a separate mac_retry counter (limit 30, ~30 seconds) so
       the main retry loop remains unlimited."

  Code:
      /* Max retries when net/ directory exists but no matching MAC found.
       * On multi-NIC PCI devices, a second VF may register later.
       * 120 retries = ~2 minutes.
       */
      #define NETVSC_MAX_MAC_RETRY 120

  Either the commit message or the macro value needs to be corrected so
  they agree. With NETVSC_HOTADD_RETRY_INTERVAL = 1s, 120 retries is
  ~2 min, not ~30 s.

Info:

  drivers/net/netvsc/hn_ethdev.c, in the new "no MAC match" block

      if (di) {
              closedir(di);
              di = NULL;

  DPDK style requires explicit pointer comparison: "if (di != NULL)".


=========================================================================
[PATCH 6/7] net/netvsc: forward per-queue stats from VF device
=========================================================================

Warning:

  drivers/net/netvsc/hn_vf.c, hn_vf_stats_get

      if (vf_dev->dev_ops->stats_get)
              ret = vf_dev->dev_ops->stats_get(vf_dev, stats, qstats);
      else
              ret = rte_eth_stats_get(vf_dev->data->port_id, stats);

  Reaching directly into another ethdev's dev_ops->stats_get is not a
  pattern used anywhere else in DPDK (bonding, failsafe, etc. all go
  through rte_eth_stats_get). The shortcut bypasses three things the
  public wrapper does:

    - RTE_ETH_VALID_PORTID_OR_ERR_RET on the VF port_id
    - memset(stats, 0) and stats->rx_nombuf = data->rx_mbuf_alloc_failed
    - eth_err() return-code translation

  The memset is harmless here because the outer ethdev wrapper has
  already zeroed stats/qstats for the netvsc port. The rx_nombuf
  attribution, however, changes: previously stats->rx_nombuf was
  pre-loaded with the VF's data->rx_mbuf_alloc_failed before the VF's
  callback ran; now it's whatever was set for the netvsc port. Worth
  documenting in the commit message if intentional.

  Also, the dev_ops pointer test uses an implicit comparison:

      if (vf_dev->dev_ops->stats_get)

  Per DPDK style this should be "!= NULL".

  A cleaner long-term fix would be to extend the ethdev API so per-queue
  stats can be forwarded through the public path, or wrap this dispatch
  in a small helper with a comment explaining why the public API is
  insufficient.

Info:

  drivers/net/netvsc/hn_vf.c

  Some VF drivers may return -ENOTSUP or partial data when called with a
  non-NULL qstats. The patch unconditionally propagates that return,
  which could turn a previously-successful stats_get into a failure for
  netvsc users on certain VF drivers. A graceful fallback to
  rte_eth_stats_get on -ENOTSUP would preserve backward compatibility.


=========================================================================
[PATCH 7/7] net/netvsc: handle VF recovery events for service reset
=========================================================================

Warning:

  drivers/net/netvsc/hn_vf.c, hn_eth_recovery_success_callback

      rte_rwlock_write_lock(&hv->vf_lock);
      if (hv->vf_ctx.vf_attached && !hv->vf_ctx.vf_vsc_switched) {
              ret = hn_nvs_set_datapath(hv, NVS_DATAPATH_VF);
              ...
      }

  hn_vf_add_unlocked guards the equivalent NVS_DATAPATH_VF switch with
  "if (dev->data->dev_started)" precisely to avoid routing traffic to
  the VF before queues are configured. The new callback omits that
  check. If the user calls rte_eth_dev_stop() on the netvsc port during
  the ERR_RECOVERING -> RECOVERY_SUCCESS window, this callback will
  switch the host data path to a VF whose DPDK queues may have been
  torn down. Recommend mirroring the dev_started check.

Warning:

  drivers/net/netvsc/hn_vf.c, hn_eth_recovering_callback and
                              hn_eth_recovery_success_callback

  Both new callbacks acquire hv->vf_lock as a writer directly in the
  event-callback context. The existing hn_eth_rmv_event_callback
  intentionally does not take vf_lock — it defers via:

      rte_eal_alarm_set(1, hn_remove_delayed, hv);

  to break a possible lock-order coupling with the calling driver (the
  PMD that fires rte_eth_dev_callback_process may itself be holding an
  internal lock). Taking vf_lock directly inside the recovery callbacks
  introduces a different lock-ordering invariant from the rest of the
  file. Consider either deferring through rte_eal_alarm_set as well, or
  adding a comment explaining why direct acquisition is safe in the
  MANA -> netvsc invocation path.

Info:

  drivers/net/netvsc/hn_vf.c, hn_eth_recovery_failed_callback

      rte_eal_alarm_set(1, hn_remove_delayed, hv);

  The state of hv->vf_ctx.vf_attached is not checked. If RECOVERY_FAILED
  arrives after a concurrent INTR_RMV has already detached the VF, the
  alarm will fire hn_remove_delayed against an already-removed port,
  generating a spurious "Start to remove port 0" log and downstream
  unregister failures. The damage is limited because hn_remove_delayed
  takes the lock and re-checks state, but the noise is avoidable with a
  vf_attached guard before scheduling the alarm.


=========================================================================
General notes on the series
=========================================================================

Patches 1, 2, and 5 together turn netvsc_hotplug_retry into a
genuinely-unbounded loop in two dimensions:

  - opendir failures retry indefinitely (patch 1, capped only by PCI
    device disappearance)
  - SIOCGIFHWADDR failures and -ENODEV from rte_eal_hotplug_add reschedule
    without incrementing any counter (patches 2, 3)
  - "no matching MAC" retries are bounded by mac_retry / 120 (patch 5)

Only patch 5's path has a hard upper bound. The other paths rely on the
PCI device eventually disappearing if the VF never comes up. On a
permanently broken VF that stays present in sysfs but never registers
a netdev, the EAL alarm thread will spin forever. Worth confirming that
this is the intended behaviour, or adding an overall ceiling (e.g., a
much higher cap than 30s but still finite) for the opendir / ioctl /
ENODEV paths.
Previous message (by thread): [PATCH 7/7] net/netvsc: handle VF recovery events for service reset
Next message (by thread): [EXTERNAL] Re: [PATCH 1/7] net/netvsc: retry VF hotplug indefinitely until PCI device disappears
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list