[dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path

Yuanhan Liu yuanhan.liu at linux.intel.com
Thu Feb 23 06:49:54 CET 2017


On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> 
> 
> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>This patch aligns the Virtio-net header on a cache-line boundary to
> >>optimize cache utilization, as it puts the Virtio-net header (which
> >>is always accessed) on the same cache line as the packet header.
> >>
> >>For example with an application that forwards packets at L2 level,
> >>a single cache-line will be accessed with this patch, instead of
> >>two before.
> >
> >I'm assuming you were testing pkt size <= (64 - hdr_size)?
> 
> No, I tested with 64 bytes packets only.

Oh, my bad, I overlooked it. While you were saying "a single cache
line", I was thinking putting the virtio net hdr and the "whole"
packet data in single cache line, which is not possible for pkt
size 64B.

> I run some more tests this morning with different packet sizes,
> and also with changing the mbuf size on guest side to have multi-
> buffers packets:
> 
> +-------+--------+--------+-------------------------+
> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> +-------+--------+--------+-------------------------+
> |    64 |   2048 |  11.05 |                   11.78 |
> |   128 |   2048 |  10.66 |                   11.48 |
> |   256 |   2048 |  10.47 |                   11.21 |
> |   512 |   2048 |  10.22 |                   10.88 |
> |  1024 |   2048 |   7.65 |                    7.84 |
> |  1500 |   2048 |   6.25 |                    6.45 |
> |  2000 |   2048 |   5.31 |                    5.43 |
> |  2048 |   2048 |   5.32 |                    4.25 |
> |  1500 |    512 |   3.89 |                    3.98 |
> |  2048 |    512 |   1.96 |                    2.02 |
> +-------+--------+--------+-------------------------+

Could you share more info, say is it a PVP test? Is mergeable on?
What's the fwd mode?

> >>In case of multi-buffers packets, next segments will be aligned on
> >>a cache-line boundary, instead of cache-line boundary minus size of
> >>vnet header before.
> >
> >The another thing is, this patch always makes the pkt data cache
> >unaligned for the first packet, which makes Zhihong's optimization
> >on memcpy (for big packet) useless.
> >
> >    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> >    Author: Zhihong Wang <zhihong.wang at intel.com>
> >    Date:   Tue Dec 6 20:31:06 2016 -0500
> 
> I did run some loopback test with large packet also, an I see a small gain
> with my patch (fwd io on both ends):
> 
> +-------+--------+--------+-------------------------+
> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> +-------+--------+--------+-------------------------+
> |  1500 |   2048 |   4.05 |                    4.14 |
> +-------+--------+--------+-------------------------+

Wierd, that basically means Zhihong's patch doesn't work? Could you add
one more colum here: what's the data when roll back to the point without
Zhihong's commit?

> >
> >        Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
> >        Reviewed-by: Yuanhan Liu <yuanhan.liu at linux.intel.com>
> >        Tested-by: Lei Yao <lei.a.yao at intel.com>
> 
> Does this need to be cache-line aligned?

Nope, the alignment size is different with different platforms. AVX512
needs a 64B alignment, while AVX2 needs 32B alignment.

> I also tried to align pkt on 16bytes boundary, basically putting header
> at HEADROOM + 4 bytes offset, but I didn't measured any gain on
> Haswell,

The fast rte_memcpy path (when dst & src is well aligned) on Haswell
(with AVX2) requires 32B alignment. Even the 16B boundary would make
it into the slow path. From this point of view, the extra pad does
not change anything. Thus, no gain is expected.

> and even a drop on SandyBridge.

That's weird, SandyBridge requries the 16B alignment, meaning the extra
pad should put it into fast path of rte_memcpy, whereas the performance
is worse.

	--yliu

> I understand your point regarding aligned memcpy, but I'm surprised I
> don't see its expected superiority with my benchmarks.
> Any thoughts?


More information about the dev mailing list