I tested KNI, and compared with virtio-user. The result is beyond my expectation:

The KNI performance is better (+30%) in simpe netperf test with TCP and different size UDP. I though they have similar performance, but it proved that KNI performed better in my test. Not sure why.

Note in my test, I did not enable checksum/gso/… offloading and multi-queue, since we need do vxLan encapsulation using SW. I am using ovs2.8.1 and dpdk 17.05.2.

In addition, one queue pair on virtio-user would create one vhost thread. If we have many containters, it seems hard to manage the CPU usage. Is there any proposal/practice to limit the vhost kthread CPU resource?

[Wang Zhike] I once saw you mentioned that something like mmap solution may be used. Is it still on your roadmap? I am not sure whether it is same as the “vhost tx zero copy”.
Can I know the forecasted day that the optimization can be done? Some Linux kernel upstream module would be updated, or DPDK module? Just want to know which modules will be touched.

Yes, I was planning to do that. But found out it helps on user->kernel path; not so easy for kernel->user path. It’s not the same as “vhost tx zero copy” (there are some restrictions BTW). The packet mmap would share a bulk of memory with user and kernel space, so that we don’t need to copy (the effect is the same with “vhost tx zero copy”). As for the date, it still lack of detailed design and feasibility analysis.

1) Yes, we have done some initial tests internally, with testpmd as the vswitch instead of OVS-DPDK; and we were comparing with KNI for exceptional path.
[Wang Zhike]Can you please kindly indicate how to configure for KNI mode? I would like to also compare it.

Now KNI is a vdev now. You can refer to this link: http://dpdk.org/doc/guides/nics/kni.html

2) We also see similar asymmetric result. For user->kernel path, it not only copies data from mbuf to skb, but also might go above to tcp stack (you can check using perf).
[Wang Zhike] Yes, indeed.  User->kernel path, tcp/ip related work is done by vhost thread, while kernel to user  thread, tcp/ip related work is done by the app (my case netperf) in syscall.

To put tcp/ip rx into app thread, actually, might avoid that with a little change on tap driver. Currently, we use netif_rx/netif_receive_skb() to rx in tap, which could result in going up to the tcp/ip stack in the vhost kthread. Instead, we could backlog the packets into other cpu (application thread's cpu?).


