[PATCH v5 6/6] vhost: optimize memcpy routines when cc memcpy is used

Morten Brørup mb at smartsharesystems.com
Mon Jul 29 21:56:27 CEST 2024


> From: Mattias Rönnblom [mailto:hofors at lysator.liu.se]
> Sent: Monday, 29 July 2024 21.27
> 
> On 2024-07-29 13:00, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom at ericsson.com]
> >> Sent: Wednesday, 24 July 2024 09.54
> >
> > Which packet mix was used for your tests? Synthetic IMIX, or some live
> data?
> >
> 
> I used the same test as was being done when the performance regression
> was demonstrated (i.e., 2x testpmd with fixed packet size).
> 
> >> +/* The code generated by GCC (and to a lesser extent, clang) with
> just
> >> + * a straight memcpy() to copy packets is less than optimal on Intel
> >> + * P-cores, for small packets. Thus the need of this specialized
> >> + * memcpy() in builds where use_cc_memcpy is set to true.
> >> + */
> >> +#if defined(RTE_USE_CC_MEMCPY) && defined(RTE_ARCH_X86_64)
> >> +static __rte_always_inline void
> >> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t
> len)
> >> +{
> >> +	void *dst = __builtin_assume_aligned(in_dst, 16);
> >> +	const void *src = __builtin_assume_aligned(in_src, 16);
> >> +
> >> +	if (len <= 256) {
> >> +		size_t left;
> >> +
> >> +		for (left = len; left >= 32; left -= 32) {
> >> +			memcpy(dst, src, 32);
> >> +			dst = RTE_PTR_ADD(dst, 32);
> >> +			src = RTE_PTR_ADD(src, 32);
> >> +		}
> >> +
> >> +		memcpy(dst, src, left);
> >> +	} else
> >
> > Although the packets within a burst often have similar size, I'm not
> sure you can rely on the dynamic branch predictor here.
> >
> 
> I agree that the pktcpy() routine will likely often suffer a
> size-related branch mispredict with real packet size mix. A benchmark
> with a real packet mix would be much better than the tests I've run.
> 
> This needs to be compared, of course, with the overhead imposed by
> conditionals included in other pktcpy() implementations.

If testing with fixed packet size, only one of the branches will be taken - always!
And thus the branch predictor will always predict it correctly - in the test.

So, if this code performs better than simple memcpy(), I can conclude that you are testing with packet size <= 256.

> 
> > Looking at the ethdev packet size counters at an ISP (at the core of
> their Layer 3 network), 71 % are 256 byte or larger [1].
> >
> > For static branch prediction, I would consider > 256 more likely and
> swap the two branches, i.e. compare (len > 256) instead of (len <= 256).
> >
> 
> OK, I'll add likely() instead, to make it more explicit.
> 
> > But again: I don't know how the dynamic branch predictor behaves here.
> Perhaps my suggested change makes no difference.
> >
> 
> I think it will, but it will be tiny. From what I understand, even when
> the branch prediction guessed correctly, one receive a slight benefit if
> the branch is not taken.
> 
> >> +		memcpy(dst, src, len);
> >> +}
> >
> > With or without suggested change,
> > Acked-by: Morten Brørup <mb at smartsharesystems.com>
> >
> >
> > [1]: Details (incl. one VLAN tag)
> > tx_size_64_packets            1,1 %
> > tx_size_65_to_127_packets    25,7 %
> > tx_size_128_to_255_packets    2,6 %
> > tx_size_256_to_511_packets    1,4 %
> > tx_size_512_to_1023_packets   1,7 %
> > tx_size_1024_to_1522_packets 67,6 %
> >


More information about the dev mailing list