[PATCH] lib/ethdev: fix segfault in secondary process by validating dev_private pointer
Stephen Hemminger
stephen at networkplumber.org
Wed Jul 23 00:28:30 CEST 2025
On Tue, 22 Jul 2025 23:05:06 +0400 (+04)
Ivan Malov <ivan.malov at arknetworks.am> wrote:
> There is a difference between control path and data path. Always has been. Yes,
> on data path, DPDK has historically sought better performance, but on the slow
> path, checks have typically been implemented, even in the flow API, with the
> only exception being "asynchronous flow" APIs, which are meant to be fast-path.
>
> Yes, the idea to have a "secondary process reference counter" in 'rte_device'
> to be either guarded with its own lock or accessed atomically by 'rte_dev_probe'
> and 'rte_dev_remove' (to increment and decrement/check respectively) as well as
> by 'rte_eth_dev_close' and 'rte_eth_dev_reset' (to decrement/check) may not be
> a hill to die on, to be honest, and might be wrong, so I have no strong opinion.
>
> What scares me most in this idea is that, one may still end up with certain
> entry points overlooked, rendering the whole effort worthless.
>
Please don't top post.
The DPDK control has (up to now) assumed that control operations are only
done from a single thread on each port. There is also the issue of hotplug
but that is separate. For example, if two threads start and stop the
same port bad thing happen and NIC driver's break.
This is not well documented and a section needs to go into programmer's guide
thread safety. The whole thread safety section is out of date, and doesn't
reference RCU when it should. It also doesn't cover hot plug or weird secondary
processes that fork.
There is also the issue of how primary/secondary monitoring work.
Right now the secondary monitors primary by periodically polling a lock file.
This inherently a racy method and leads to problems. It needs to be redesigned
to use a blocking method something like spawning a thread in secondary
that uses some part of the existing Unix domain IPC to get notification
when primary crashes or wants to exit. Ideally it would support synchronous
handshake with all primaries and asynchronous case when primary crashes.
The point is that bandaid's in the ethdev layer won't fix it well
enough.
More information about the dev
mailing list