[dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path
Maxime Coquelin
maxime.coquelin at redhat.com
Wed Feb 22 10:36:36 CET 2017
On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>> This patch aligns the Virtio-net header on a cache-line boundary to
>> optimize cache utilization, as it puts the Virtio-net header (which
>> is always accessed) on the same cache line as the packet header.
>>
>> For example with an application that forwards packets at L2 level,
>> a single cache-line will be accessed with this patch, instead of
>> two before.
>
> I'm assuming you were testing pkt size <= (64 - hdr_size)?
No, I tested with 64 bytes packets only.
I run some more tests this morning with different packet sizes,
and also with changing the mbuf size on guest side to have multi-
buffers packets:
+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
| 64 | 2048 | 11.05 | 11.78 |
| 128 | 2048 | 10.66 | 11.48 |
| 256 | 2048 | 10.47 | 11.21 |
| 512 | 2048 | 10.22 | 10.88 |
| 1024 | 2048 | 7.65 | 7.84 |
| 1500 | 2048 | 6.25 | 6.45 |
| 2000 | 2048 | 5.31 | 5.43 |
| 2048 | 2048 | 5.32 | 4.25 |
| 1500 | 512 | 3.89 | 3.98 |
| 2048 | 512 | 1.96 | 2.02 |
+-------+--------+--------+-------------------------+
Overall we can see it is always beneficial.
The only case we see a drop is the 2048/2048 case, which is explained
because it needs two buffers as the vnet header + pkt does not fit in
2048 bytes.
It could be fixed by aligning vnet header to the cacheline before,
inside the headroom.
>
>> In case of multi-buffers packets, next segments will be aligned on
>> a cache-line boundary, instead of cache-line boundary minus size of
>> vnet header before.
>
> The another thing is, this patch always makes the pkt data cache
> unaligned for the first packet, which makes Zhihong's optimization
> on memcpy (for big packet) useless.
>
> commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> Author: Zhihong Wang <zhihong.wang at intel.com>
> Date: Tue Dec 6 20:31:06 2016 -0500
>
> eal: optimize aligned memcpy on x86
>
> This patch optimizes rte_memcpy for well aligned cases, where both
> dst and src addr are aligned to maximum MOV width. It introduces a
> dedicated function called rte_memcpy_aligned to handle the aligned
> cases with simplified instruction stream. The existing rte_memcpy
> cases with simplified instruction stream. The existing rte_memcpy
> is renamed as rte_memcpy_generic. The selection between them 2 is
> done at the entry of rte_memcpy.
>
> The existing rte_memcpy is for generic cases, it handles unaligned
> copies and make store aligned, it even makes load aligned for micro
> architectures like Ivy Bridge. However alignment handling comes at
> a price: It adds extra load/store instructions, which can cause
> complications sometime.
>
> DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
> The copy is aligned, and remote, and there is header write along
> which is also remote. In this case the memcpy instruction stream
> should be simplified, to reduce extra load/store, therefore reduce
> the probability of load/store buffer full caused pipeline stall, to
> let the actual memcpy instructions be issued and let H/W prefetcher
> goes to work as early as possible.
>
> This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> from 64 to 1500 bytes.
>
> The test can also be conducted without NIC, by setting loopback
> traffic between Virtio and Vhost. For example, modify the macro
> TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
> rebuild and start testpmd in both host and guest, then "start" on
> one side and "start tx_first 32" on the other.
I did run some loopback test with large packet also, an I see a small
gain with my patch (fwd io on both ends):
+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
| 1500 | 2048 | 4.05 | 4.14 |
+-------+--------+--------+-------------------------+
>
> Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
> Reviewed-by: Yuanhan Liu <yuanhan.liu at linux.intel.com>
> Tested-by: Lei Yao <lei.a.yao at intel.com>
Does this need to be cache-line aligned?
I also tried to align pkt on 16bytes boundary, basically putting header
at HEADROOM + 4 bytes offset, but I didn't measured any gain on
Haswell, and even a drop on SandyBridge.
I understand your point regarding aligned memcpy, but I'm surprised I
don't see its expected superiority with my benchmarks.
Any thoughts?
Cheers,
Maxime
>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin at redhat.com>
>> ---
>>
>> Hi,
>>
>> I send this patch as RFC because I get strange results on SandyBridge.
>>
>> For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
>> performance drop on SandyBridge (~-18%).
>> When running PVP benchmark on SandyBridge, I measure a +4% performance
>> gain though.
>>
>> So I'd like to call for testing on this patch, especially PVP-like testing
>> on newer architectures.
>>
>> Regarding SandyBridge, I would be interrested to know whether we should
>> take the performance drop into account, as we for example had one patch in
>> last release that cause a performance drop on SB we merged anyway.
>
> Sorry, would you remind me which patch it is?
>
> --yliu
>
More information about the dev
mailing list