[dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path

Yuanhan Liu yuanhan.liu at linux.intel.com
Wed Feb 22 02:37:34 CET 2017


On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> This patch aligns the Virtio-net header on a cache-line boundary to
> optimize cache utilization, as it puts the Virtio-net header (which
> is always accessed) on the same cache line as the packet header.
> 
> For example with an application that forwards packets at L2 level,
> a single cache-line will be accessed with this patch, instead of
> two before.

I'm assuming you were testing pkt size <= (64 - hdr_size)?

> In case of multi-buffers packets, next segments will be aligned on
> a cache-line boundary, instead of cache-line boundary minus size of
> vnet header before.

The another thing is, this patch always makes the pkt data cache
unaligned for the first packet, which makes Zhihong's optimization
on memcpy (for big packet) useless.

    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
    Author: Zhihong Wang <zhihong.wang at intel.com>
    Date:   Tue Dec 6 20:31:06 2016 -0500
    
        eal: optimize aligned memcpy on x86
    
        This patch optimizes rte_memcpy for well aligned cases, where both
        dst and src addr are aligned to maximum MOV width. It introduces a
        dedicated function called rte_memcpy_aligned to handle the aligned
        cases with simplified instruction stream. The existing rte_memcpy
        cases with simplified instruction stream. The existing rte_memcpy
        is renamed as rte_memcpy_generic. The selection between them 2 is
        done at the entry of rte_memcpy.
    
        The existing rte_memcpy is for generic cases, it handles unaligned
        copies and make store aligned, it even makes load aligned for micro
        architectures like Ivy Bridge. However alignment handling comes at
        a price: It adds extra load/store instructions, which can cause
        complications sometime.
    
        DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
        The copy is aligned, and remote, and there is header write along
        which is also remote. In this case the memcpy instruction stream
        should be simplified, to reduce extra load/store, therefore reduce
        the probability of load/store buffer full caused pipeline stall, to
        let the actual memcpy instructions be issued and let H/W prefetcher
        goes to work as early as possible.
    
        This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
        up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
        from 64 to 1500 bytes.
    
        The test can also be conducted without NIC, by setting loopback
        traffic between Virtio and Vhost. For example, modify the macro
        TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
        rebuild and start testpmd in both host and guest, then "start" on
        one side and "start tx_first 32" on the other.
    
        Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
        Reviewed-by: Yuanhan Liu <yuanhan.liu at linux.intel.com>
        Tested-by: Lei Yao <lei.a.yao at intel.com>
    
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin at redhat.com>
> ---
> 
> Hi,
> 
> I send this patch as RFC because I get strange results on SandyBridge.
> 
> For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
> performance drop on SandyBridge (~-18%).
> When running PVP benchmark on SandyBridge, I measure a +4% performance
> gain though.
> 
> So I'd like to call for testing on this patch, especially PVP-like testing
> on newer architectures.
> 
> Regarding SandyBridge, I would be interrested to know whether we should
> take the performance drop into account, as we for example had one patch in
> last release that cause a performance drop on SB we merged anyway.

Sorry, would you remind me which patch it is?

	--yliu


More information about the dev mailing list