[PATCH v5 6/6] vhost: optimize memcpy routines when cc memcpy is used

Mattias Rönnblom hofors at lysator.liu.se
Mon Jul 29 21:27:01 CEST 2024


On 2024-07-29 13:00, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
>> Sent: Wednesday, 24 July 2024 09.54
> 
> Which packet mix was used for your tests? Synthetic IMIX, or some live data?
> 

I used the same test as was being done when the performance regression 
was demonstrated (i.e., 2x testpmd with fixed packet size).

>> +/* The code generated by GCC (and to a lesser extent, clang) with just
>> + * a straight memcpy() to copy packets is less than optimal on Intel
>> + * P-cores, for small packets. Thus the need of this specialized
>> + * memcpy() in builds where use_cc_memcpy is set to true.
>> + */
>> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
>> +static __rte_always_inline void
>> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
>> +{
>> +	void *dst = __builtin_assume_aligned(in_dst, 16);
>> +	const void *src = __builtin_assume_aligned(in_src, 16);
>> +
>> +	if (len <= 256) {
>> +		size_t left;
>> +
>> +		for (left = len; left >= 32; left -= 32) {
>> +			memcpy(dst, src, 32);
>> +			dst = RTE_PTR_ADD(dst, 32);
>> +			src = RTE_PTR_ADD(src, 32);
>> +		}
>> +
>> +		memcpy(dst, src, left);
>> +	} else
> 
> Although the packets within a burst often have similar size, I'm not sure you can rely on the dynamic branch predictor here.
> 

I agree that the pktcpy() routine will likely often suffer a 
size-related branch mispredict with real packet size mix. A benchmark 
with a real packet mix would be much better than the tests I've run.

This needs to be compared, of course, with the overhead imposed by 
conditionals included in other pktcpy() implementations.

> Looking at the ethdev packet size counters at an ISP (at the core of their Layer 3 network), 71 % are 256 byte or larger [1].
> 
> For static branch prediction, I would consider > 256 more likely and swap the two branches, i.e. compare (len > 256) instead of (len <= 256).
> 

OK, I'll add likely() instead, to make it more explicit.

> But again: I don't know how the dynamic branch predictor behaves here. Perhaps my suggested change makes no difference.
> 

I think it will, but it will be tiny. From what I understand, even when 
the branch prediction guessed correctly, one receive a slight benefit if 
the branch is not taken.

>> +		memcpy(dst, src, len);
>> +}
> 
> With or without suggested change,
> Acked-by: Morten Brørup <mb at smartsharesystems.com>
> 
> 
> [1]: Details (incl. one VLAN tag)
> tx_size_64_packets            1,1 %
> tx_size_65_to_127_packets    25,7 %
> tx_size_128_to_255_packets    2,6 %
> tx_size_256_to_511_packets    1,4 %
> tx_size_512_to_1023_packets   1,7 %
> tx_size_1024_to_1522_packets 67,6 %
> 


More information about the dev mailing list