[dpdk-dev] Enhance KNI DPDK-app-side to be Multi-Producer/Consumer
marc.sune at bisdn.de
Sat Nov 15 00:44:37 CET 2014
On 14/11/14 22:05, Sanford, Robert wrote:
> Hello Thomas,
> I want to discuss a small enhancement to KNI that we developed. We wanted
> to send/receive mbufs between one KNI device and multiple cores, but we
> wanted to keep the changes simple. So, here were our requirements:
> 1. Don't use heavy synchronization when reading/writing the FIFOs in
> shared memory.
> 2. Don't make any one core (data or control plane) perform too much work
> shuffling mbufs to/from the FIFOs.
> 3. Don't use an intermediate RTE ring to drop off mbufs when another core
> is already accessing the same KNI FIFO.
> 4. Since (for our private use case) we don't need MP/MC on the kernel
> side, don't change the kernel KNI code at all.
> 5. Don't change the request/reply logic. It stays single-threaded on both
I've done also a quick look during this last days, but still I need to
allocate some time to look it carefully. I haven't done anything yet
though, so I would be very much interested to see your patch.
With the current KNI implementation, and even with the very small patch
([PATCH] Adding RTE_KNI_PREEMPT configuration option) that I sent to the
list (btw, any volunteer to review it? ), we are getting with a single
core KNI a very very high delay, comparing direct kernel (std. driver)
with IGB/IXGB-> KNI->KERNEL->KNI->IGB/IXGB (I was expecting a slightly
increased delay, but not of a factor of x100 or more as I get). In
addition the jitter is really horrendous (buffering between rte_kni and
kernel module?). Btw, we are using a single Kthread (kni_single).
We cannot confirm yet what I am about to say , but having a sequence of
KNIs (PHY>KNI1>KNI2->...->PHY), we have observed a huge decrease in
performance (even with the NO_PREEMPT patch) and, what worries me more,
I've seen reordering of packets.
Of course, we have first to assure that these issues comented before are
not a problem of our DPDK-based switch, but the code is similar to l3fwd
(it is partially based on it). Once I find some time, I will try to
modify the l2fwd example to try this out, and isolate the problem.
One last thing we have obeserved in our application is the huge amount
of mbufs that KNI interfaces require. But this could be a product of the
buffering either in the rte_kni/kernel module or the kernel itself to
treat the incoming skbs, or a combination. I also need to further
> Here is what we came up with:
> 1. Borrow heavily from the librte_ring implementation.
> 2. In librte_kni structure rte_kni, supplement each rte_kni_fifo (tx, rx,
> alloc, and free q) with another private, corresponding structure that
> contains a head, tail, mask, and size field.
> 3. Create kni_fifo_put_mp function with merged logic of kni_fifo_put and
> __rte_ring_mp_do_enqueue. After we update the tail index (which is private
> to the DPDK-app), we update the FIFO write index (shared between app and
> 4. Create kni_fifo_get_mc function with merged logic of kni_fifo_get and
> __rte_ring_mc_do_dequeue. After we update the tail index, update the FIFO
> read index.
> 5. In rte_kni_tx_burst and kni_alloc_mbufs, call kni_fifo_put_mp instead
> of kni_fifo_put.
> 6. In rte_kni_rx_burst and kni_free_bufs, call kni_fifo_get_mc instead of
> We believe this is a common use case, and thus would like to contribute it
> to dpdk.org.
> 1. Are you interested for us to send the changes as an RFC?
I am, personally.
> 2. Do you agree with this approach, or would it be better, say, to rewrite
> both sides of the interface to be more like librte_ring?
I was thinking about something like this.
But one of the original authors can perhaps shed some light here.
> 3. Perhaps we could improve upon kni_allocate_mbufs, as it blindly
> attempts to allocate and enqueue a constant number of mbufs. We have not
> really focused on the kernel ==> app data path, because we were only
> interested in app ==> kernel.
Seen that and agree. Looks like it is a waste of cycles, at first
glance. There is a small patch:
[PATCH] kni: optimizing the rte_kni_rx_burst
That seems to partially address this issue, but as far as I have seen,
it is only addressing the case where no pkts are being send, but not
cases where for instance
@Thomas: I rebased this patch to master HEAD, and I (quickly) tried this
patch, but I don't have results. AFAIU from the patch, it only optimizes
the case where no buffers are being sent in that KNI, so actually the
benchmarking setup seems to be slightly more complicate. In any case,
perhaps waiting for Robert's contribution and merge both would make sense.
> Robert Sanford
More information about the dev