|WARNING| [v4,5/5] eal: avoid deadlock in async IPC alarm callback

dpdklab at iol.unh.edu dpdklab at iol.unh.edu
Fri Jun 5 17:57:42 CEST 2026
Previous message (by thread): |SUCCESS|dpdk-next-net-intel|ad518cf8ac| intel-Functional
Next message (by thread): |SUCCESS| pw165033 [PATCH] fib6: fix error code propagation on next hop update
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Test-Label: ai-code-review
Test-Status: WARNING
https://dpdk.org/patch/165032

_AI Code Review_

=== Patch Review: series.patch (via Claude) ===
Review date: 2026-06-05

# DPDK Patch Review

## PATCH 1/5: eal: fix wrong log message in async IPC request

**No issues found.**

---

## PATCH 2/5: eal: fix async IPC callback not fired when no peers

### Errors

**Missing error path cleanup after alarm set failure (secondary process path)**

In the secondary process path (lines 1198-1205), when `rte_eal_alarm_set()` fails, the code removes the dummy from the queue but does not unlock `pending_requests.lock` before jumping to `fail`. The `fail` label expects the lock to be held and will unlock it, but the function will return while still holding the lock acquired at line 1186.

Suggested fix: The rollback code is correct (remove from queue, clear `dummy_used`), but it should `goto fail` which will handle unlocking. However, verify the lock is still held at that point -- if the code path is correct as written, this is fine. Actually, looking more carefully: the lock is acquired at line 1186, and the `goto fail` at line 1205 will reach the `fail` label which unlocks at line 1216. The structure is correct. **Withdraw this item.**

**Inconsistent error handling between primary and secondary paths**

In the primary process path (lines 1252-1253), when `rte_eal_alarm_set()` fails for the dummy request, the code logs an error but does NOT set `ret = -1` or `goto fail`. This leaves `ret == 0` and the function returns success even though the alarm was not set, meaning the callback will never fire and the user callback is lost.

Compare to the secondary path (lines 1198-1205) where alarm failure correctly sets `ret = -1` and rolls back.

Suggested fix:
```c
if (rte_eal_alarm_set(1, async_reply_handle,
		(void *)(uintptr_t)dummy->id) < 0) {
	EAL_LOG(ERR, "Fail to set alarm for dummy request");
	/* roll back the changes */
	TAILQ_REMOVE(&pending_requests.requests, dummy, next);
	dummy_used = false;
	ret = -1;
	goto fail;
}
```

---

## PATCH 3/5: eal: fix memory leak in async IPC secondary path

**No issues found.** The fix correctly frees the unused dummy on the success path.

---

## PATCH 4/5: eal: fix async IPC resource leaks on partial failure

### Errors

**Use-after-free: alarm callback can access freed request after partial rollback**

The partial-failure rollback (lines 1274-1291) removes pending requests from the queue and frees them (`free(r->reply)`, `free(r)`) while the alarm is still active. The alarm callback `async_reply_handle()` is scheduled with `(void *)(uintptr_t)req->id` as the argument.

If the alarm fires after the request is freed but before `async_reply_handle()` runs, the callback will call `find_async_request_by_id(id)` (line 580), which searches the now-empty queue and returns `NULL`. That is safe -- the callback becomes a no-op.

However, the comment at line 1277 says "Any alarm callback that runs later for these removed IDs will not find a pending request and will return." This is correct. The code is safe because:
1. The alarm arg is the numeric `id`, not a pointer to the request
2. The callback does ID-based lookup, not pointer dereference
3. If the ID is not found, the callback returns without touching the freed memory

**Withdraw this item** -- the code is correct.

**Missing alarm cancellation on partial failure (potential correctness issue)**

When rolling back queued requests on partial failure (lines 1274-1291), the code removes requests from the queue but does not cancel their alarms. The alarms will fire later, call `find_async_request_by_id()`, find nothing, and return harmlessly.

However, this leaves spurious alarms active. While not a correctness bug (the callback handles missing requests safely), it wastes CPU cycles and EAL alarm framework resources.

Consider calling `rte_eal_alarm_cancel(async_reply_handle, (void *)(uintptr_t)r->id)` in the rollback loop. If cancellation fails with `EINPROGRESS`, that is fine -- the callback will complete and find nothing.

This is a **Warning** -- the current code is safe but not optimal.

---

## PATCH 5/5: eal: avoid deadlock in async IPC alarm callback

### Errors

**Potential alarm leak on request removal**

The patch removes the `rte_eal_alarm_cancel()` call from `async_reply_handle_thread_unsafe()` (lines 552-562 deleted). The commit message justifies this: cancelling can deadlock because `rte_eal_alarm_cancel()` spin-waits for an executing callback while the callback is blocked on `pending_requests.lock`.

However, this means every request now has an alarm that will fire even if the request completes normally (via reply receipt). When the alarm fires, `find_async_request_by_id()` returns `NULL` and the callback does nothing. This is safe but leaves spurious alarms running.

The commit message acknowledges this ("let stale alarms fire; with ID-based lookup the callback will simply not find the request and return harmlessly"). This is a deliberate trade-off to avoid deadlock.

**Do NOT flag** -- the design is intentional and documented.

---

## PATCH 4/5 (additional review): Alarm set moved before allocation

**Potential issue: alarm can fire before request is fully initialized**

In `mp_request_async()`, the alarm is now set (line 908) BEFORE allocating `pending_req` and `reply_msg` (line 914). If the alarm fires immediately (e.g., `ts` is very small or the system is under load), `async_reply_handle()` will call `find_async_request_by_id(id)` and find nothing because the request has not been inserted into the queue yet.

This is safe (the callback returns harmlessly) but unusual. The comment at lines 900-907 explains: "Set alarm before allocating or sending. The alarm is never cancelled... Instead, let stale alarms fire."

The design is that the alarm starts the timeout window immediately when the ID is reserved, even if allocation or sending fails. If allocation fails, the alarm fires and finds no request (harmless). If sending succeeds, the request is inserted and the alarm will find it when it fires.

**Do NOT flag** -- this is intentional per the comment.

---

## Summary

### Errors (must fix)

**Patch 2/5:**
- Inconsistent error handling between primary and secondary paths: when `rte_eal_alarm_set()` fails for the dummy request in the primary process path (line 1252), the code logs an error but does not set `ret = -1` or roll back, returning success incorrectly. The secondary path correctly handles this failure.

### Warnings (should fix)

**Patch 4/5:**
- Missing alarm cancellation on partial failure rollback: alarms remain active for freed requests, causing harmless but wasteful spurious callbacks. Consider cancelling alarms during rollback (acceptable if cancellation fails with `EINPROGRESS`).

### Info

No additional suggestions.
Previous message (by thread): |SUCCESS|dpdk-next-net-intel|ad518cf8ac| intel-Functional
Next message (by thread): |SUCCESS| pw165033 [PATCH] fib6: fix error code propagation on next hop update
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the test-report mailing list