[PATCH v2 1/1] examples/l2fwd-jobstats: add delay to show stats

Stephen Hemminger stephen at networkplumber.org
Mon Mar 2 17:23:28 CET 2026


On Mon, 29 Jul 2024 11:50:37 +0530
Rakesh Kudurumalla <rkudurumalla at marvell.com> wrote:

> This patch addresses the issue by introducing a delay
> before acquiring the lock in the loop. This delay allows for better
> availability of the lock, ensuring that show_lcore_stats() can
> periodically update the statistics even when forwarding jobs are running.
> 
> Fixes: 204896f8d66c ("examples/l2fwd-jobstats: add new example")
> Cc: stable at dpdk.org
> 
> Signed-off-by: Rakesh Kudurumalla <rkudurumalla at marvell.com>

A more long winded AI analysis of why this example is an architectural mess.


---

## 1. The Main Loop is Structurally Convoluted

The `l2fwd_main_loop()` is the core problem. It uses three nested loops with a spinlock held across the entire active processing path:

```
for (;;) {                          // outer: infinite
    rte_spinlock_lock(&qconf->lock);
    do {                            // middle: context loop
        // idle job inner loop
        do {                        // inner: busy-wait polling timers
            ...
        } while (!need_manage);
        rte_timer_manage();
    } while (stats_read_pending == 0);
    rte_spinlock_unlock(&qconf->lock);
    rte_pause();
}
```

**Problems:**

The **spinlock is held during all packet processing**. It's only released briefly when the stats reader sets `stats_read_pending`, which means `show_lcore_stats()` on the main thread must wait for the datapath to notice the flag *and* complete a full context cycle before it can acquire the lock. This creates an unpredictable latency spike for stats collection and makes the lock contention window essentially unbounded during normal forwarding.

The **inner idle loop is a raw busy-poll** comparing `timer.expire < now` for every RX timer plus the flush timer on each iteration. With 16 ports per lcore (MAX_RX_QUEUE_PER_LCORE), that's 17 comparisons per spin. There's no `rte_pause()` inside the idle loop — only outside the outer spinlock unlock. The recent patch you may have seen (replacing `rte_pause()` with `rte_delay_us(10)` after unlock) treats the symptom, not the disease.

The **triple-loop structure makes the control flow genuinely hard to reason about**. The middle `do...while` tests `stats_read_pending` which is set by a remote lcore, but the inner loop checks it too (for `need_manage`). The idle job abort-vs-finish logic (`repeats != 1`) is a subtle optimization that's easy to misread — if the idle loop body executes exactly once, the job is aborted rather than finished, meaning the timer was already expired on entry.

## 2. The Flush Job Has an Inverted Condition (Possible Bug)

In `l2fwd_flush_job()`:

```c
if (qconf->next_flush_time[portid] <= now)
    continue;
```

This **skips flushing when the time has expired**, which is backwards. You'd expect to flush when `now >= next_flush_time`, i.e. skip when `now < next_flush_time`. This means TX buffers only get flushed when they *haven't* expired yet — the opposite of the intended drain behavior. This could cause packets to sit in TX buffers far longer than the intended 100μs drain interval, contributing directly to the "slow" feeling at low traffic rates.

Additionally, `lcore_id` and `qconf` are fetched twice at the top of this function for no reason.

## 3. MAC Address Manipulation via Aliasing Violation

In `l2fwd_simple_forward()`:

```c
tmp = &eth->dst_addr.addr_bytes[0];
*((uint64_t *)tmp) = 0x000000000002 + ((uint64_t)dst_port << 40);
```

This writes 8 bytes into a 6-byte MAC address through a `uint64_t*` cast, which is both a strict aliasing violation and writes 2 bytes past the MAC address into the ethertype field. The intent is to write `02:00:00:00:00:xx` as the destination MAC, and it "works" on little-endian x86 because the excess bytes happen to overwrite `src_addr` bytes that are immediately overwritten by the `rte_ether_addr_copy()` below. But this is fragile, non-portable, and undefined behavior. On a big-endian architecture this produces a completely wrong MAC.

## 4. Global Mutable State Everywhere

The application uses an excessive amount of file-scope globals:

- `l2fwd_ports_eth_addr[RTE_MAX_ETHPORTS]`
- `l2fwd_enabled_port_mask`
- `l2fwd_dst_ports[RTE_MAX_ETHPORTS]`
- `lcore_queue_conf[RTE_MAX_LCORE]`
- `tx_buffer[RTE_MAX_ETHPORTS]`
- `port_statistics[RTE_MAX_ETHPORTS]`
- `l2fwd_pktmbuf_pool`
- `hz`, `drain_tsc`, `timer_period`

These are all `RTE_MAX_ETHPORTS`/`RTE_MAX_LCORE` sized arrays, statically allocated regardless of actual usage. `port_statistics` is accessed from both the datapath lcores (writes) and the stats display callback (reads) with **no synchronization** — the stats numbers can tear on platforms without naturally atomic 64-bit loads. The `memset(&port_statistics, 0, ...)` inside the per-port init loop also clears all port stats on each port init, not just the current port's.

## 5. The Timer-Driven Architecture Adds Complexity Without Proportional Benefit

The original l2fwd example uses a simple poll loop: check TSC, drain if needed, burst RX, forward. This jobstats variant replaces that with:

- `rte_timer` for each RX port + flush timer per lcore
- `rte_jobstats` wrappers around each timer callback
- A custom `l2fwd_job_update_cb` that adjusts polling frequency via a hysteresis algorithm with asymmetric step sizes (`UPDATE_STEP_UP = 1`, `UPDATE_STEP_DOWN = 32`)

The adaptive polling idea (poll less when idle, poll more when busy) is sound in theory, but the implementation has issues. The step-up granularity of 1 TSC tick is so small relative to step-down of 32 that recovery from an idle period is extremely slow. If the system goes idle for a period and the polling interval ramps up, it takes 32x as many busy iterations to ramp back down. The target of `MAX_PKT_BURST` (32 packets) as the "optimal" value is hard-coded with no way to tune it.

The `rte_timer_manage()` call itself is not free — it walks the timer skip list on each invocation, adding overhead that a direct TSC comparison in plain l2fwd avoids.

## 6. The Double RX Burst is Questionable

```c
total_nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, MAX_PKT_BURST);
// ... forward all ...
if (total_nb_rx == MAX_PKT_BURST) {
    const uint16_t nb_rx = rte_eth_rx_burst(portid, 0, pkts_burst, MAX_PKT_BURST);
    // ... forward these too ...
}
```

The comment says this "allows rte_jobstats logic to see if this function must be called more frequently." The total is fed to `rte_jobstats_finish()` which compares against the target. But this means `total_nb_rx` can be up to 64 while the target is 32, and the jobstats update callback sees `result > target` and decreases the period. The asymmetric step sizes make this interaction even more unpredictable. This also means the prefetch pattern is suboptimal — the first batch is prefetched and forwarded sequentially (prefetch one, forward one), which doesn't give the prefetch enough lead time to be useful.

## 7. Argument Parsing Edge Cases

`l2fwd_parse_portmask()` returns `unsigned long` cast through `int` (0 means error), so a mask of 0 is indistinguishable from a parse error. On 64-bit systems where `unsigned long` is 64 bits but `l2fwd_enabled_port_mask` is `uint32_t`, masks above 0xFFFFFFFF silently truncate. The port mask check `(1 << portid)` throughout the code is undefined behavior for `portid >= 32` since `1` is an `int`.

## 8. No Graceful Shutdown

There's no signal handler. The application can only be killed with SIGKILL/SIGTERM, leaving ports in an undefined state (promiscuous mode enabled, potentially with traffic still queued in TX buffers). The `rte_eal_cleanup()` at the end is unreachable during normal operation since `l2fwd_main_loop()` never returns.

## Summary Recommendations

If you're looking at cleaning this up or using it as a basis for review guidelines:

The flush condition inversion should be treated as a bug fix. The main loop should be restructured to separate the stats synchronization concern from the packet processing path — a lock-free stats snapshot approach (double-buffering or seqlock) would eliminate the spinlock entirely. The MAC address write needs to use `rte_ether_addr` properly. The adaptive polling logic needs tunable parameters and a more balanced step function. And ideally, the example should be reconciled with plain l2fwd to show clearly what jobstats *adds* rather than being a near-complete fork with structural divergence.


More information about the dev mailing list