[PATCH v5 6/6] vhost: optimize memcpy routines when cc memcpy is used
Mattias Rönnblom
hofors at lysator.liu.se
Mon Jul 29 21:27:01 CEST 2024
On 2024-07-29 13:00, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
>> Sent: Wednesday, 24 July 2024 09.54
>
> Which packet mix was used for your tests? Synthetic IMIX, or some live data?
>
I used the same test as was being done when the performance regression
was demonstrated (i.e., 2x testpmd with fixed packet size).
>> +/* The code generated by GCC (and to a lesser extent, clang) with just
>> + * a straight memcpy() to copy packets is less than optimal on Intel
>> + * P-cores, for small packets. Thus the need of this specialized
>> + * memcpy() in builds where use_cc_memcpy is set to true.
>> + */
>> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
>> +static __rte_always_inline void
>> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
>> +{
>> + void *dst = __builtin_assume_aligned(in_dst, 16);
>> + const void *src = __builtin_assume_aligned(in_src, 16);
>> +
>> + if (len <= 256) {
>> + size_t left;
>> +
>> + for (left = len; left >= 32; left -= 32) {
>> + memcpy(dst, src, 32);
>> + dst = RTE_PTR_ADD(dst, 32);
>> + src = RTE_PTR_ADD(src, 32);
>> + }
>> +
>> + memcpy(dst, src, left);
>> + } else
>
> Although the packets within a burst often have similar size, I'm not sure you can rely on the dynamic branch predictor here.
>
I agree that the pktcpy() routine will likely often suffer a
size-related branch mispredict with real packet size mix. A benchmark
with a real packet mix would be much better than the tests I've run.
This needs to be compared, of course, with the overhead imposed by
conditionals included in other pktcpy() implementations.
> Looking at the ethdev packet size counters at an ISP (at the core of their Layer 3 network), 71 % are 256 byte or larger [1].
>
> For static branch prediction, I would consider > 256 more likely and swap the two branches, i.e. compare (len > 256) instead of (len <= 256).
>
OK, I'll add likely() instead, to make it more explicit.
> But again: I don't know how the dynamic branch predictor behaves here. Perhaps my suggested change makes no difference.
>
I think it will, but it will be tiny. From what I understand, even when
the branch prediction guessed correctly, one receive a slight benefit if
the branch is not taken.
>> + memcpy(dst, src, len);
>> +}
>
> With or without suggested change,
> Acked-by: Morten Brørup <mb at smartsharesystems.com>
>
>
> [1]: Details (incl. one VLAN tag)
> tx_size_64_packets 1,1 %
> tx_size_65_to_127_packets 25,7 %
> tx_size_128_to_255_packets 2,6 %
> tx_size_256_to_511_packets 1,4 %
> tx_size_512_to_1023_packets 1,7 %
> tx_size_1024_to_1522_packets 67,6 %
>
More information about the dev
mailing list