[dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path

Maxime Coquelin maxime.coquelin at redhat.com
Wed Feb 22 10:36:36 CET 2017

Previous message: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Next message: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]


On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>> This patch aligns the Virtio-net header on a cache-line boundary to
>> optimize cache utilization, as it puts the Virtio-net header (which
>> is always accessed) on the same cache line as the packet header.
>>
>> For example with an application that forwards packets at L2 level,
>> a single cache-line will be accessed with this patch, instead of
>> two before.
>
> I'm assuming you were testing pkt size <= (64 - hdr_size)?

No, I tested with 64 bytes packets only.
I run some more tests this morning with different packet sizes,
and also with changing the mbuf size on guest side to have multi-
buffers packets:

+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
|    64 |   2048 |  11.05 |                   11.78 |
|   128 |   2048 |  10.66 |                   11.48 |
|   256 |   2048 |  10.47 |                   11.21 |
|   512 |   2048 |  10.22 |                   10.88 |
|  1024 |   2048 |   7.65 |                    7.84 |
|  1500 |   2048 |   6.25 |                    6.45 |
|  2000 |   2048 |   5.31 |                    5.43 |
|  2048 |   2048 |   5.32 |                    4.25 |
|  1500 |    512 |   3.89 |                    3.98 |
|  2048 |    512 |   1.96 |                    2.02 |
+-------+--------+--------+-------------------------+

Overall we can see it is always beneficial.
The only case we see a drop is the 2048/2048 case, which is explained 
because it needs two buffers as the vnet header + pkt does not fit in 
2048 bytes.
It could be fixed by aligning vnet header to the cacheline before,
inside the headroom.


>
>> In case of multi-buffers packets, next segments will be aligned on
>> a cache-line boundary, instead of cache-line boundary minus size of
>> vnet header before.
>
> The another thing is, this patch always makes the pkt data cache
> unaligned for the first packet, which makes Zhihong's optimization
> on memcpy (for big packet) useless.
>
>     commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
>     Author: Zhihong Wang <zhihong.wang at intel.com>
>     Date:   Tue Dec 6 20:31:06 2016 -0500
>
>         eal: optimize aligned memcpy on x86
>
>         This patch optimizes rte_memcpy for well aligned cases, where both
>         dst and src addr are aligned to maximum MOV width. It introduces a
>         dedicated function called rte_memcpy_aligned to handle the aligned
>         cases with simplified instruction stream. The existing rte_memcpy
>         cases with simplified instruction stream. The existing rte_memcpy
>         is renamed as rte_memcpy_generic. The selection between them 2 is
>         done at the entry of rte_memcpy.
>
>         The existing rte_memcpy is for generic cases, it handles unaligned
>         copies and make store aligned, it even makes load aligned for micro
>         architectures like Ivy Bridge. However alignment handling comes at
>         a price: It adds extra load/store instructions, which can cause
>         complications sometime.
>
>         DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
>         The copy is aligned, and remote, and there is header write along
>         which is also remote. In this case the memcpy instruction stream
>         should be simplified, to reduce extra load/store, therefore reduce
>         the probability of load/store buffer full caused pipeline stall, to
>         let the actual memcpy instructions be issued and let H/W prefetcher
>         goes to work as early as possible.
>
>         This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
>         up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
>         from 64 to 1500 bytes.
>
>         The test can also be conducted without NIC, by setting loopback
>         traffic between Virtio and Vhost. For example, modify the macro
>         TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
>         rebuild and start testpmd in both host and guest, then "start" on
>         one side and "start tx_first 32" on the other.

I did run some loopback test with large packet also, an I see a small 
gain with my patch (fwd io on both ends):

+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
|  1500 |   2048 |   4.05 |                    4.14 |
+-------+--------+--------+-------------------------+


>
>         Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
>         Reviewed-by: Yuanhan Liu <yuanhan.liu at linux.intel.com>
>         Tested-by: Lei Yao <lei.a.yao at intel.com>

Does this need to be cache-line aligned?
I also tried to align pkt on 16bytes boundary, basically putting header
at HEADROOM + 4 bytes offset, but I didn't measured any gain on
Haswell, and even a drop on SandyBridge.


I understand your point regarding aligned memcpy, but I'm surprised I
don't see its expected superiority with my benchmarks.
Any thoughts?

Cheers,
Maxime

>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin at redhat.com>
>> ---
>>
>> Hi,
>>
>> I send this patch as RFC because I get strange results on SandyBridge.
>>
>> For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
>> performance drop on SandyBridge (~-18%).
>> When running PVP benchmark on SandyBridge, I measure a +4% performance
>> gain though.
>>
>> So I'd like to call for testing on this patch, especially PVP-like testing
>> on newer architectures.
>>
>> Regarding SandyBridge, I would be interrested to know whether we should
>> take the performance drop into account, as we for example had one patch in
>> last release that cause a performance drop on SB we merged anyway.
>
> Sorry, would you remind me which patch it is?
>
> 	--yliu
>

Previous message: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Next message: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list