[PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt
Kevin Traynor
ktraynor at redhat.com
Tue Feb 10 20:07:52 CET 2026
On 10/02/2026 17:05, Slava Ovsiienko wrote:
> Hi,
>
Hi Slava,
> I'm sorry, I have some concern about the patch.
>
No problem, that's what reviews are for :-) thanks for reviewing.
> How it works, as far as I understand:
>
> - DPDK simulates interrupts in user mode with epoll_wait()
> - mlx5 PMD emits the async counter query command to the NIC periodically
I didn't think this would happen unless there was something like
hardware offload, but regardless, yes I agree there may be async counter
queries.
> - there might be multiple async query commands in the flight
> - kernel drivers handles the async query completion interrupts, pushes the token to the internal completion queue and unblocks associated fd
> - epoll_wait() sees this unblocked fd and notifies mlx5 PMD about
> - mlx5 PMD reads the completion token from the kernel queue with devx_get_async_cmd_comp()
>
> The concern scenario, let's assume:
>
> - we have 2 async query commands in the flight
> - the first async query completes, fd is unblocked, PMD is inviked, the completion is read by PMD and is being handled
> - the second async query completes, fd gets unblocked, the second token is written to the queue
> - the PMD completes the handling of the first completion and reads the queue again (with devx_get_async_cmd_comp() call in the loop)
> - it reads the second token successfully and handles
> - then, on the third call, devx_get_async_cmd_comp() returns EAGAIN, it means queue is empty
> - DPDK calls epoll_wait() again and sees unblocked fd
> - it call mlx5 PMD, and it calls devx_get_async_cmd_comp(), but queue is empty (handled in previous interrupt handling)
> - with the patch we wrongly remove the handler
>
I'm not sure, but this ^^^ sounds feasible.
> In my opinion, we should handle flags EPOLLERR | EPOLLHUP | EPOLLRDHU from the epoll_wait()_return also for
> RTE_INTR_HANDLE_EXT and RTE_INTR_HANDLE_DEV_EVENT interrupt types.
>
That's exactly what I had in v1 of the patch! The issue is that some
clients of eal interrupt may not interpret the condition of
EPOLLHUP/EPOLLRDHUP as an error condition and/or want to do some special
handling.
The example is vhost user server, which puts in place a reconnect
mechanism. If we filter out EPOLLHUP/EPOLLRDHUP events in eal, then
virtio will not receive the callback and vhost server reconnect is
broken. I have some more notes about it in the cover letter.
Trying to base on the read pattern in devx handler was an attempt to
move logic out of eal so different handlers could be flexible in how
they handle this condition.
We do have a distinction in that mlx5 uses RTE_INTR_HANDLE_EXT and
virtio uses RTE_INTR_HANDLE_VDEV but i'm not sure that is generic enough
to base a check/don't check for EPOLLHUP/EPOLLRDHUP events on.
So we'd need to come up with another solution if we wanted to filter
this in eal. Let's think more on this, though we are a bit constrained
by public API as well.
A workaround we can do from application is David's hack™ "-a
0000:00:00.0" to skip initial probe. That will at least prevent the
issue for mlx devices not used in DPDK, which was the scenario reported.
thanks,
Kevin.
> With best regards,
> Slava
>
>
>
>
>> -----Original Message-----
>> From: Kevin Traynor <ktraynor at redhat.com>
>> Sent: Tuesday, February 10, 2026 5:06 PM
>> To: Stephen Hemminger <stephen at networkplumber.org>
>> Cc: dev at dpdk.org; NBU-Contact-Thomas Monjalon (EXTERNAL)
>> <thomas at monjalon.net>; david.marchand at redhat.com; Dariusz Sosnowski
>> <dsosnowski at nvidia.com>; Slava Ovsiienko <viacheslavo at nvidia.com>;
>> stable at dpdk.org
>> Subject: Re: [PATCH v2 1/2] net/mlx5: check for no data read in devx interrupt
>>
>> On 07/02/2026 06:09, Stephen Hemminger wrote:
>>> On Fri, 6 Feb 2026 17:20:53 +0000
>>> Kevin Traynor <ktraynor at redhat.com> wrote:
>>>
>>>> A busy-loop may occur when there are EPOLLERR, EPOLLHUP or
>> EPOLLRDHUP
>>>> epoll events for the devx interrupt fd.
>>>>
>>>> This may happen if the interrupt fd is deleted, if the device is
>>>> unbound from mlx5_core kernel driver or if the device is removed by
>>>> the mlx5 kernel driver as part of LAG setup.
>>>>
>>>> When that occurs, there is no data to be read and in the devx
>>>> interrupt handler an EAGAIN is returned on the first call to
>>>> devx_get_async_cmd_comp, but this is not checked.
>>>>
>>>> As the interrupt is not removed or condition reset, it causes an
>>>> interrupt processing busy-loop, which leads to the dpdk-intr thread
>>>> going to 100% CPU.
>>>>
>>>> e.g.
>>>> epoll_wait
>>>> (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1) =
>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>> = -1 EAGAIN (Resource temporarily unavailable) epoll_wait
>>>> (6, [{events=EPOLLIN|EPOLLRDHUP, data={u32=28, u64=28}}], 8, -1) =
>>>> 1 read(28, 0x7f1f5c7fc2f0, 40)
>>>> = -1 EAGAIN (Resource temporarily unavailable)
>>>>
>>>> Add a check for an EAGAIN return from devx_get_async_cmd_comp on the
>>>> first read. If that happens, unregister the callback to prevent looping.
>>>>
>>>> Bugzilla ID: 1873
>>>> Fixes: f15db67df09c ("net/mlx5: accelerate DV flow counter query")
>>>> Cc: stable at dpdk.org
>>>>
>>>> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
>>>
>>> AI spotted this, I didn't...
>>>
>>>
>>> Errors:
>>>
>>> Line 139: Unnecessary semicolon after closing brace
>>>
>>> c
>>>
>>> };
>>>
>>> Should be:
>>> c
>>>
>>> }
>>>
>>> Lines 142-146: Block comment uses incorrect style Block comments in C
>> code should use /* and */ style, not /** which is reserved for documentation
>> comments.
>>>
>>> c
>>>
>>> /**
>>> * no data and EAGAIN indicate there is an error or
>>> * disconnect state. Unregister callback to prevent
>>> * interrupt busy-looping.
>>> */
>>>
>>> Should be:
>>> c
>>>
>>> /*
>>> * no data and EAGAIN indicate there is an error or
>>> * disconnect state. Unregister callback to prevent
>>> * interrupt busy-looping.
>>> */
>>>
>>> Warnings:
>>>
>>> Logic clarity: The variable data_read is set to true inside the while loop but
>> never checked when data WAS read. Consider if data_read is the clearest way to
>> express this condition.
>>>
>>
>> Ack above. Thanks.Will be fixed in v3.
>
More information about the stable
mailing list