[PATCH v11] eal/x86: optimize memcpy of small sizes

Stephen Hemminger stephen at networkplumber.org
Fri May 22 00:42:14 CEST 2026


On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <mb at smartsharesystems.com> wrote:

> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
> 
> Furthermore, __rte_restrict was added to source and destination addresses.
> 
> Also, the missing implementation of rte_mov48() was added.
> 
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
> 
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
> 
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
> 
> Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> Acked-by: Bruce Richardson <bruce.richardson at intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev at huawei.com>
> ---

Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.

$ ./devtools/ai/compare-patch-reviews.sh ~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch 
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================

>>> Running review with: anthropic

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude) ===
Review date: 2026-05-21

# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch

## Correctness Issues

### Errors

**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684, 704, 707)**

The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n < 16`, results in accessing memory before the `dst` pointer. This is used in an optimization technique (two overlapping stores) that is correct when `n >= 16`, but the code paths in question guarantee `n` is in specific ranges where this is safe. However, the readability is poor and the intent is not obvious from the code.

**Confidence: 80%** - This pattern is widely used in memcpy implementations and appears intentional, but deserves a comment explaining the technique to avoid confusion in future reviews.

---

## Warnings

**Missing documentation for new function rte_mov48() (line 130)**

The patch adds `rte_mov48()` but does not mention it in the commit message's "Also, the missing implementation of rte_mov48() was added." The implementation itself is correct, but there's no context about why it was missing or whether it's used elsewhere in DPDK.

**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or document why it's being added speculatively.

---

**Inconsistent handling of compile-time constant size (lines 682-683)**

The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`

But in the 33-64 byte range (lines 695-707), there's no similar optimization. This is not a bug, but the inconsistency in optimization strategy could be noted.

**Suggestion:** Add a comment explaining the rationale for when this optimization matters vs when the "harmless duplicate" is acceptable.

---

**Potential alignment assumption in AVX path (line 699)**

The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes 33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```

For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be misaligned for the 32-byte `__m256i` operations. However, `_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting unaligned access, so this is safe.

**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on unaligned load/store support.

---

## Style and Process Issues

### Warnings

**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**

The multi-line comment explaining the rationale for the `rte_uint*_alias` structures does not follow the standard DPDK comment block format with the `*` on each line. However, this may be intentional for a technical explanation.

---

**Return value inconsistency (lines 682, 690-691, 693, 707-708)**

The function returns `dst` in some branches but the original `ret` pointer in others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which returns `ret = dst`

Both `ret` and `dst` point to the same location, so this is harmless, but using `dst` consistently would be clearer. The patch description states "Pointer to the destination data" is returned, and both satisfy this.

**Confidence: 95%** - Not an error, but inconsistent style within the same function.

---

## Documentation Issues

### Warnings

**Missing documentation update for removed warning suppression (lines removed at top)**

The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```

The commit message explains this is because the problematic drivers were updated, with a `Depends-on: series-38174`. However, there's no comment in the code explaining why the suppression was removed or a note in `doc/guides/rel_notes/` documenting this change.

**Suggestion:** Add a release note entry documenting the removal of the warning suppression and the dependency on the driver updates.

---

## Summary

**Priority findings:**

1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless but could be unified).

**Overall assessment:** The patch appears functionally correct. The main concerns are documentation completeness (missing context for `rte_mov48()`, missing release notes for warning removal) and code clarity (overlapping copy pattern could use explanatory comments). No blocking correctness issues were identified.
(anthropic: review reported errors)

========================================

>>> Running review with: openai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via ChatGPT) ===
Review date: 2026-05-21

**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

---

### Errors

None found.

- The patch does not introduce resource leaks, use-after-free, race conditions, or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy routines; all copy size calculations and code paths for small copies (<64 bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values are always valid).

---

### Warnings

1. **Potential inconsistent comment phrasing regarding overlap requirements.**

   - Some Doxygen comments use "locations should not overlap" while others now correctly say "locations must not overlap".
   - While this isn't a correctness bug, consistent strong wording clarifies requirement for users.

   **Suggested fix:** Standardize all relevant Doxygen comments to use "locations must not overlap."

   ```c
   /**
    * Copy bytes from one location to another,
    * locations must not overlap.
    */
   ```

2. **Redundant Doxygen param alignment in function headers.**

   - Some parameter descriptions ("Pointer to the destination of the data"/"Pointer to the destination data") remain a bit unclear or inconsistent. If updating comments, consider making param descriptions exact and explicit.

   **Suggested fix:** For each `@param` clarify if it's the destination or source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer to the source."` suffice.

---

### Info

- Use of `__rte_restrict` is correct and improves optimization; matches DPDK style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test infrastructure.
- No RST or external documentation is involved in this patch.

---

### Summary

**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but these do not affect correctness.

**This patch is in good shape and complies with DPDK coding and architectural guidelines.**
(openai: review reported errors)

========================================

>>> Running review with: xai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok) ===
Review date: 2026-05-21

**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

### Errors

**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency / correctness risk)**  
The new `rte_mov48()` function is declared with `__rte_restrict`, but the implementation casts away the restrict qualifier when calling `rte_mov32()` and `rte_mov16()`:

```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```

**Suggested fix**: Keep the restrict-qualified parameters in the implementation or cast only after documenting the aliasing contract. This is a minor style inconsistency but worth fixing for uniformity.

**2. Potential sign-extension issue on 64-bit offset calculations (left-shift of narrow type)**  
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the subsequent pointer arithmetic:

```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t *)src + n - 8);
```

relies on implicit promotion. While currently safe, the pattern of adding a `uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an explicit cast:

```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```

**3. Use of `__rte_constant` inside hot path without documented behavior**  
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known sizes. This is a DPDK internal macro. While acceptable for performance-critical code, it should be documented (or at least noted) that the optimization only triggers for constant `n`. No correctness bug, but worth a comment.

### Warnings

**1. Duplicate function definitions for AVX/SSE variants**  
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()` and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This increases maintenance burden and instruction cache pressure. Consider factoring the common >64-byte logic into a single helper.

**2. Hard-coded cast chains in rte_mov48()**  
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on restrict-qualified pointers. While functionally correct, it reduces readability. A cleaner pattern would be:

```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```

**3. Inconsistent use of `return dst` vs `return ret`**  
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret` are identical, using `ret` consistently would avoid any future divergence.

**4. Missing Doxygen for new rte_mov48()**  
The new function has a Doxygen comment, but the description is minimal. Consider adding a brief note that it is an internal helper (or mark it `__rte_internal` if it should not be public).

### Info

- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow` is correct and aligns with the series that removed `rte_memcpy` usage from drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review (they are not code).

### Summary

No critical correctness bugs (use-after-free, leaks, races, or missing error checks) were identified. The main issues are minor style/consistency items around restrict qualifiers, pointer arithmetic, and code duplication. The patch is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy implementation.
(xai: review reported errors)

========================================

>>> Running review with: google

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini) ===
Review date: 2026-05-21

The following is a review of the patch against DPDK coding standards and guidelines.

### Warnings

**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types, rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))

/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```

### Info

**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)

========================================

Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0



More information about the dev mailing list