[dpdk-dev] [PATCH v2 0/2] Vhost: unitfy receive paths

Maxime Coquelin maxime.coquelin at redhat.com
Thu May 31 14:52:33 CEST 2018



On 05/31/2018 11:55 AM, Wang, Zhihong wrote:
> 
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
>> Sent: Tuesday, May 29, 2018 5:45 PM
>> To: dev at dpdk.org; Bie, Tiwei <tiwei.bie at intel.com>; Wang, Zhihong
>> <zhihong.wang at intel.com>
>> Cc: Maxime Coquelin <maxime.coquelin at redhat.com>
>> Subject: [PATCH v2 0/2] Vhost: unitfy receive paths
>>
>> Hi,
>>
>> This second version fixes the feature bit check in
>> rxvq_is_mergeable(), and remove "mergeable" from rx funcs
>> names. No difference is seen in the benchmarks
>>
>> This series is preliminary work to ease the integration of
>> packed ring layout support. But even without packed ring
>> layout, the result is positive.
>>
>> First patch unify both paths, and second one is a small
>> optimization to avoid copying batch_copy_nb_elems VQ field
>> to/from the stack.
>>
>> With the series applied, I get modest performance gain for
>> both mergeable and non-mergeable casesi (, and the gain of
>> about 300 LoC is non negligible maintenance-wise.
>>
>> Rx-mrg=off benchmarks:
>>
>> +------------+-------+-------------+-------------+----------+
>> |    Run     |  PVP  | Guest->Host | Host->Guest | Loopback |
>> +------------+-------+-------------+-------------+----------+
>> | v18.05-rc5 | 14.47 |       16.64 |       17.57 |    13.15 |
>> | + series   | 14.87 |       16.86 |       17.70 |    13.30 |
>> +------------+-------+-------------+-------------+----------+
>>
>> Rx-mrg=on benchmarks:
>>
>> +------------+------+-------------+-------------+----------+
>> |    Run     | PVP  | Guest->Host | Host->Guest | Loopback |
>> +------------+------+-------------+-------------+----------+
>> | v18.05-rc5 | 9.38 |       13.78 |       16.70 |    12.79 |
>> | + series   | 9.38 |       13.80 |       17.49 |    13.36 |
>> +------------+------+-------------+-------------+----------+
>>
>> Note: Even without my series, the guest->host benchmark with
>> mergeable buffers enabled looks suspicious as it should in
>> theory be alsmost identical as when Rx mergeable buffers are
>> disabled. To be investigated...
>>
>> Maxime Coquelin (2):
>>    vhost: unify Rx mergeable and non-mergeable paths
>>    vhost: improve batched copies performance
>>
>>   lib/librte_vhost/virtio_net.c | 376 +++++-------------------------------------
>>   1 file changed, 37 insertions(+), 339 deletions(-)
>>
> 
> Acked-by: Zhihong Wang <zhihong.wang at intel.com>
> 
> Thanks Maxime! This is really great to see. ;) We probably need the
> same improvement for Virtio-pmd.

Yes, probably. I'll have a look at it, or if you have time to look at
it, won't blame you! :)

> One comment on Virtio/Vhost performance analysis: No matter what type
> of traffic is used (PVP, or Txonly-Rxonly, Loopback...), we need to
> be clear on who we're testing, and give the other part excessive CPU
> resources, otherwise we'll be testing whoever the slowest.
> 
> Since this patch is for Vhost, I suggest to run N (e.g. N = 4) Virtio
> threads on N cores, and the corresponding N Vhost threads on a single
> core, to do performance comparison. Do you think this makes sense?

That's a valid point. I'll try this to get the bottleneck.
I'm in the process of setting up an automated test bench, it will help
running more and more test cases.

> For Guest -> Host, in my test I see Rx-mrg=on has negative impact on
> Virtio side, probably because Virtio touches something that's not
> touched when Rx-mrg=off.

I get it now.
When mrg=off, we use simple_tx version whereas we use the full one when
mrg is off:

static int
virtio_dev_configure(struct rte_eth_dev *dev)
{
...
	hw->use_simple_rx = 1;
	hw->use_simple_tx = 1;

#if defined RTE_ARCH_ARM64 || defined RTE_ARCH_ARM
	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
		hw->use_simple_rx = 0;
		hw->use_simple_tx = 0;
	}
#endif
	if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF)) {
		hw->use_simple_rx = 0;
		hw->use_simple_tx = 0;
	}

	if (rx_offloads & (DEV_RX_OFFLOAD_UDP_CKSUM |
			   DEV_RX_OFFLOAD_TCP_CKSUM))
		hw->use_simple_rx = 0;

	return 0;
}

I see two problems here:
1. There should be no reasons not to use simple_tx if mrg is on.
2. We should add test on whether rx and tx offloads have been
negotiated to not use simple versions if it has been.

Do you agree with that proposed changes?
I'll post a RFC for this.

Thanks,
Maxime

> Thanks
> -Zhihong
> 


More information about the dev mailing list