[PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames

Stephen Hemminger stephen at networkplumber.org
Tue Mar 24 00:42:05 CET 2026


On Wed, 18 Mar 2026 10:47:26 +0100
Sriram Yagnaraman <sriram.yagnaraman at ericsson.com> wrote:

> Enable jumbo frame reception with default mbuf data room size by
> chaining multiple mbufs when packet exceeds single mbuf tailroom.
> 
> Scatter Rx is only enabled when RTE_ETH_RX_OFFLOAD_SCATTER is
> requested. Packets are dropped if they exceed single mbuf size
> and scatter is not enabled, or if mbuf allocation fails during
> chaining. Error counter rx_dropped_pkts tracks all drops.
> 
> This allows receiving 9KB jumbo frames using standard 2KB mbufs,
> chaining ~5 segments per jumbo packet.
> 
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman at ericsson.com>
> ---

In general looks good, but AI review found some things.

Summary:

Thanks for the v5, Sriram. A few comments below.

1) pkt_len is uint16_t but tp_snaplen is __u32

The scatter helper also takes uint16_t data_len, so anything above 64KB silently truncates. For jumbo frames this is fine in practice, but the type mismatch is a latent bug. Consider uint32_t for both, or at least a comment explaining why uint16_t is sufficient.

2) Scatter alloc failure should bump rx_nombuf, not just rx_dropped_pkts

When rte_pktmbuf_alloc() fails for a chained segment inside eth_af_packet_rx_scatter(), only rx_dropped_pkts is incremented. The head mbuf alloc failure correctly bumps rx_nombuf. The scatter case is also a no-mbuf condition and should be reported as such, otherwise operators can't distinguish memory pressure from oversized-packet drops in the stats.

3) Reclaiming headroom in chained segments

The prepend/zero-data_len/zero-pkt_len sequence works but is roundabout. Setting m->data_off = 0 directly would be cleaner and avoid the intermediate state where data_len is non-zero then immediately overwritten.

4) bufs[num_rx] fix looks correct

The change from bufs[i] to bufs[num_rx] fixes a pre-existing gap in the output array when packets are dropped via continue. Good catch.

Overall the structure looks solid — the scatter helper extraction is clean, the goto-based frame release avoids the earlier use-after-free risk, and the queue setup validation is correct.


Full detail:

## Review: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames

### Error: `eth_af_packet_rx_scatter()` leaks chained mbufs on allocation failure

When `rte_pktmbuf_alloc()` fails inside the `while` loop, the function returns `-1`. The caller then calls `rte_pktmbuf_free(mbuf)` on the head mbuf. However, before the failure, previous iterations of the loop have already linked allocated segments via `m->next`. The question is whether `rte_pktmbuf_free()` walks the chain — it does, so segments linked before the failure *will* be freed. But there's a subtlety: `mbuf->nb_segs` is incremented for each successfully chained segment, and `rte_pktmbuf_free()` uses `nb_segs` to walk the chain. Since `nb_segs` and the `->next` linkage are kept in sync, this is actually correct.

On closer inspection: **no leak here.** Disregard — the caller's `rte_pktmbuf_free(mbuf)` correctly frees the entire chain. (Keeping this analysis transparent so you can verify the reasoning.)

### Error: `pkt_len` declared as `uint16_t` but `tp_snaplen` is `uint32_t` — silent truncation

`ppd->tp_snaplen` is `__u32` in the kernel's `tpacket2_hdr`. The patch declares `pkt_len` as `uint16_t`:

```c
uint16_t pkt_len;
...
pkt_len = ppd->tp_snaplen;
```

For jumbo frames up to 9KB this works, but `tp_snaplen` can be up to 65535 (or larger with cooked sockets). If `tp_snaplen > 65535` this silently truncates, and even values in the 32768–65535 range could cause issues if `pkt_len` is used in signed comparisons. More practically, the scatter helper also takes `uint16_t data_len`, capping the maximum receivable packet at 64KB.

Since this patch specifically targets jumbo frames (9KB), the practical risk is low, but the type should match the source. Consider using `uint32_t` for `pkt_len` and the `data_len` parameter of `eth_af_packet_rx_scatter()`, or at minimum add a bounds check / comment explaining why `uint16_t` is sufficient.

### Error: `rte_pktmbuf_pkt_len(mbuf)` not set in the single-mbuf fast path before VLAN reinsertion

In the patched code, the single-mbuf path sets `data_len` but defers setting `pkt_len` until after the VLAN/timestamp block:

```c
if (pkt_len <= rte_pktmbuf_tailroom(mbuf)) {
    memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, pkt_len);
    rte_pktmbuf_data_len(mbuf) = pkt_len;
} else if (...) { ... }

rte_pktmbuf_pkt_len(mbuf) = pkt_len;

/* check for vlan info */
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
    ...
    if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
```

`rte_vlan_insert()` reads `mbuf->pkt_len` to determine packet length. At that point `pkt_len` has been assigned to `rte_pktmbuf_pkt_len(mbuf)` — wait, looking again, `pkt_len` assignment is on the line *before* the VLAN block. Let me re-read the patch flow:

```
        } else { ... goto release_frame; }

        rte_pktmbuf_pkt_len(mbuf) = pkt_len;   // <-- this line

        /* check for vlan info */
```

Yes, `pkt_len` is set before the VLAN check. This is correct. Disregard.

### Warning: `rx_nombuf` not incremented on scatter allocation failure

In the original code, `rx_nombuf` is incremented when `rte_pktmbuf_alloc()` fails for the head mbuf. In the scatter path, when `rte_pktmbuf_alloc()` fails for a *chained* segment, only `rx_dropped_pkts` is incremented. This is arguably a no-mbuf condition and should also increment `rx_nombuf` so that `rte_eth_stats.rx_nombuf` accurately reflects memory pressure. The current code makes it impossible to distinguish "dropped because too large and no scatter" from "dropped because mbuf pool exhausted during scatter."

Suggested fix: increment `pkt_q->rx_nombuf` instead of (or in addition to) `rx_dropped_pkts` when scatter allocation fails.

### Warning: `rte_pktmbuf_prepend()` return value unchecked

In `eth_af_packet_rx_scatter()`:

```c
rte_pktmbuf_prepend(m, rte_pktmbuf_headroom(m));
m->pkt_len = 0;
m->data_len = 0;
```

`rte_pktmbuf_prepend()` can return `NULL` if the headroom requested exceeds available headroom (shouldn't happen for a freshly allocated mbuf, but it's defensive). More importantly, after `rte_pktmbuf_prepend()`, `data_len` is increased by the headroom amount, but then immediately overwritten to `0`. The `pkt_len` is similarly set to `0`. This sequence works but is fragile — a cleaner approach would be to directly manipulate `m->data_off = 0` to reclaim the headroom, which is what this is effectively doing. Consider:

```c
m->data_off = 0;
```

This is simpler, less error-prone, and avoids the intermediate state where `data_len` contains a non-zero value that is immediately discarded.

### Warning: `scatter_enabled` read from `rxmode.offloads` in `rx_queue_setup`

Per AGENTS.md guidance, reading from `dev->data->dev_conf.rxmode` after configure can become stale. In this case, `rx_queue_setup` is called between `dev_configure` and `dev_start`, so the value should be consistent. However, the canonical pattern in DPDK is to check `dev->data->scattered_rx` or the offloads via `dev->data->dev_conf.rxmode.offloads` (which is acceptable in queue setup since it runs in the configure-to-start window). This is borderline acceptable but worth noting — if `mtu_set` is called after queue setup, the scatter state won't be re-evaluated. The AF_PACKET PMD's `mtu_set` doesn't re-select Rx functions or re-evaluate scatter, which is a pre-existing gap not introduced by this patch.

### Info: `bufs[i]` changed to `bufs[num_rx]` — correct fix for existing bug

The original code used `bufs[i]` which would leave gaps in the output array when packets are dropped (via `continue`). The change to `bufs[num_rx]` is correct and actually fixes a pre-existing bug in the drop path. Good catch in this patch.

### Info: Consider adding `rx_nombuf` tracking for the scatter failure case

As noted above, distinguishing "no mbuf" from "oversized drop" in statistics would be valuable for debugging. Minor suggestion for a follow-up.


More information about the dev mailing list