<div dir="ltr">Hi Ivan, <br><div>Thanks for the detailed explanation. </div><div><br></div><div>While primary termination is a known limitation, I believe the application or PMD should handle edge cases proactively to ensure stability.<br><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Jul 22, 2025 at 9:26 PM Ivan Malov <<a href="mailto:ivan.malov@arknetworks.am">ivan.malov@arknetworks.am</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Dariusz, Khadem,<br>
<br>
On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:<br>
<br>
> On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote:<br>
>> 21/07/2025 13:46, Ivan Malov:<br>
>>> On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:<br>
>>><br>
>>>> + mlx5 maintainers<br>
>>>><br>
>>>> Thank you for the patch.<br>
>>>><br>
>>>> Could you please include other PMD maintainers (or other maintainers, depending on changed code)<br>
>>>> in the future patches?<br>
>>>> There is a script which automatically adds maintainers while sending a patch.<br>
>>>> It is described in: <a href="https://doc.dpdk.org/guides/contributing/patches.html#sending-patches" rel="noreferrer" target="_blank">https://doc.dpdk.org/guides/contributing/patches.html#sending-patches</a><br>
>>>><br>
>>>> On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:<br>
>>>>> When the primary process exits, the shared mlx5 state becomes<br>
>>>>> unavailable to secondary processes. If a secondary process attempts<br>
>>>>> to query device information (e.g., via testpmd), a NULL dereference<br>
>>>>> may occur due to missing shared data.<br>
>>>>><br>
>>>>> This patch adds a check for shared context availability and fails<br>
>>>>> gracefully while preventing a crash.<br>
>>>>><br>
>>>>> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")<br>
>>>>> Cc: <a href="mailto:stable@dpdk.org" target="_blank">stable@dpdk.org</a><br>
>>>>><br>
>>>>> Signed-off-by: Khadem Ullah <<a href="mailto:14pwcse1224@uetpeshawar.edu.pk" target="_blank">14pwcse1224@uetpeshawar.edu.pk</a>><br>
>>>>> ---<br>
>>>>> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++<br>
>>>>> 1 file changed, 6 insertions(+)<br>
>>>>><br>
>>>>> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c<br>
>>>>> index 68d1c1bfa7..1848f6536a 100644<br>
>>>>> --- a/drivers/net/mlx5/mlx5_ethdev.c<br>
>>>>> +++ b/drivers/net/mlx5/mlx5_ethdev.c<br>
>>>>> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)<br>
>>>>> * Since we need one CQ per QP, the limit is the minimum number<br>
>>>>> * between the two values.<br>
>>>>> */<br>
>>>>> + if (priv == NULL || priv->sh == NULL) {<br>
>>>>> + DRV_LOG(ERR,<br>
>>>>> + "mlx5 shared data unavailable (primary process likely exited)");<br>
>>>>> + rte_errno = ENODEV;<br>
>>>>> + return -rte_errno;<br>
>>>>> + }<br>
>>>><br>
>>>> I don't think it's an issue on PMD level, but rather on<br>
>>>> ethdev/multi-process handling level.<br>
>>><br>
>>> I may be very wrong here, but can't the PMD use its internal 'shared' state to<br>
>>> somehow memorise the fact that a secondary process has attached, in order to not<br>
>>> let the primary process close in the first place (before the secondary process<br>
>>> detaches)? Or is this going to violate some established convention?<br>
>><br>
>> It looks to be a good idea.<br>
><br>
> I agree with idea of adding these checks, but not entirely agree with<br>
> it being at driver level, since all drivers would have to duplicate this logic.<br>
><br>
> Drivers already have to go through ethdev library when port is probed<br>
> on secondary process - rte_eth_dev_attach_secondary() must be<br>
> called to retrieve port data local to the process and device data shared<br>
> between processes.<br>
> Memorizing whether a secondary process is using given port can be added<br>
> to rte_eth_dev_attach_secondary(), and relevant check for primary<br>
> process can then be added to rte_eth_dev_close(), so that all drivers<br>
> benefit.<br>
><br>
> What do you think?<br>
<br>
Yes, in general, the idea would be to have a "secondary reference counter" field<br>
in 'rte_eth_dev_data' and have it incremented/decremented/checked in some<br>
generic places. The issue is that, in addition to 'rte_eth_dev_close', a device<br>
can be "closed" via its respective bus 'remove' method. Yes, there are drivers<br>
that invoke 'rte_eth_dev_close' inside the 'remove' implementation, but not all<br>
of them, unfortunately. For example, take a look at 'af_packet' and similar.<br>
<br>
On the other hands, there are drivers that use 'rte_eth_dev_create' and its<br>
counterpart 'rte_eth_dev_destroy' in the implementations of bus 'probe/remove',<br>
but that is also not consistent across the 'drivers/net' tree.<br>
<br>
Given all these issues and the fact that ethdev is not the only possible device<br>
type, may be 'rte_device' can house the reference counter, with 'rte_dev_probe'<br>
and 'rte_dev_remove' augmented to do the inrement/decrement/validate thing?<br>
And, since 'rte_eth_dev_close' invokes direct ethdev 'close' method, also add a<br>
check to over there (and may be to 'rte_eth_dev_reset', too)? May be I'm wrong.<br>
<br>
However, all of that might still make no sense, given the fact that the primary<br>
process can just die. So, may be I am indeed very wrong and, as Stephen has<br>
suggested in [1], this issue it to be addressed by clarifying the documentation.<br>
<br>
[1] <a href="https://mails.dpdk.org/archives/dev/2025-July/321865.html" rel="noreferrer" target="_blank">https://mails.dpdk.org/archives/dev/2025-July/321865.html</a><br>
<br>
Thank you.<br>
<br>
><br>
>><br>
>>>> When primary process closes the port, ethdev library zeroes and frees<br>
>>>> device data shared between processes.<br>
>>>> ethdev port data (rte_eth_dev) on secondary is not updated so it now points to<br>
>>>> invalid data. rte_eth_dev_info_get() is not the only API call affected.<br>
>>>><br>
>>>> If the primary process closes the port before exiting<br>
>>>> (like testpmd does) and it exits before the secondary,<br>
>>>> the any driver call seems invalid because of that use-after-free behavior.<br>
>>>><br>
>>>> @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports<br>
>>>> in secondary process after primary has gracefully exited is supported?<br>
>><br>
>> No there is no statement about whether it is supported or not.<br>
>> I think we should at least return an error instead of crashing.<br>
>><br>
>><br>
><br>
</blockquote></div><div><br clear="all"></div><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="color:rgb(12,100,192)">Engr. Khadem Ullah, </span><br></div><div style="font-family:Calibri,Arial,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><span style="color:rgb(12,100,192)">Software Engineer, </span><br></div><span style="color:rgb(12,100,192)">Dreambig Semiconductor Inc <br></span></div><div dir="ltr"><span style="color:rgb(12,100,192)"><a href="https://dreambigsemi.com/" target="_blank">https://dreambigsemi.com/</a><br></span></div></div></div></div>