[dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling

Jan Viktorin viktorin at rehivetech.com
Thu Jan 12 16:02:56 CET 2017


On Thu, 12 Jan 2017 10:30:58 +0800
Yuanhan Liu <yuanhan.liu at linux.intel.com> wrote:

> On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> > 2017-01-11 12:27, Yuanhan Liu:  
> > > The fact that virtio net header is initiated to zero in PMD driver
> > > init stage means that these costly writes are unnecessary and could
> > > be avoided:
> > > 
> > >     if (hdr->csum_start != 0)
> > >         hdr->csum_start = 0;
> > > 
> > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > > performance drop introduced by TSO enabling is recovered: it could
> > > be up to 20% in micro benchmarking.  
> > 
> > This patch is adding a condition to assignments.
> > We need a benchmark on other architectures like ARM. Please anyone?  
> 
> I think the cost of condition should be way lower than the cost from the
> penalty introduced by the cache issue, that I don't see it would perform
> bad on other platforms.
> 
> But, of course, testing is always welcome!
> 
> 	--yliu

Hello,

we've done a synthetic measurement, principle briefly:

== Without condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
	hdr->csum_start = 0;
	hdr->csum_offset = 0;
	hdr->flags = 0;
}

end = gettimeofday();


== With condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
	ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
	ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
	ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
}

end = gettimeofday();


== Results ==

Computed as total time of all threads:

for i = 1..THREAD_COUNT:
	result += end[i] - start[i]

cpu           threads  without-check (ms)  with-check
Xeon E5-2670        1            516              529
Xeon E5-2670        2           1155              953
Xeon E5-2670        8           8947             5044
Xeon E5-2670       16          23335            16836
Zynq-7020 (armv7)   1           6735             7205
Zynq-7020 (armv7)   2          13753            14418

The advantage for Intel is evident when increasing the number
of threads.

However, on 32-bit ARMs we might expect some performance drop.

Regards
Jan

> > 
> > 
> > [...]  
> > > +/* avoid write operation when necessary, to lessen cache issues */
> > > +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> > > +	if ((var) != (val))			\
> > > +		(var) = (val);			\
> > > +} while (0)  


More information about the dev mailing list