[RFC PATCH v1] net/i40e: put mempool cache out of API

Konstantin Ananyev konstantin.v.ananyev at yandex.ru
Sun Jul 3 14:20:13 CEST 2022


> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> out of API to free buffers directly. There are two changes different
> with previous version:
> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> 2. put cache out of "mempool_bulk" API to copy buffers into it directly
> 
> Performance Test with l3fwd neon path:
> 		with this patch
> n1sdp:		no perforamnce change
> amper-altra:	+4.0%
> 


Thanks for RFC, appreciate your effort.
So, as I understand - bypassing mempool put/get itself
gives about 7-10% speedup for RX/TX on ARM platforms,
correct?

About direct-rearm RX approach you propose:
After another thought, probably it is possible to
re-arrange it in a way that would help avoid related negatives.
The basic idea as follows:

1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
    Also make sw_ring de-coupled from RXQ itself, i.e:
    when RXQ is stopped or even destroyed, related sw_ring may still
    exist (probably ref-counter or RCU would be sufficient here).
    All that means we need a common layout/api for rxq_sw_ring
    and PMDs that would like to support direct-rearming will have to
    use/obey it.

2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
    at txq_free_bufs() try to store released mbufs inside attached
    sw_ring directly. If there is no attached sw_ring, or not enough
    free space in it - continue with mempool_put() as usual.
    Note that actual arming of HW RXDs still remains responsibility
    of RX code-path:
    rxq_rearm(rxq) {
      ...
      - check are there are N already filled entries inside rxq_sw_ring.
        if not, populate them from mempool (usual mempool_get()).
      - arm related RXDs and mark these sw_ring entries as managed by HW.
      ...
    }


So rxq_sw_ring will serve two purposes:
- track mbufs that are managed by HW (that what it does now)
- private (per RXQ) mbuf cache

Now, if TXQ is stopped while RXQ is running -
no extra synchronization is required, RXQ would just use
mempool_get() to rearm its sw_ring itself.

If RXQ is stopped while TXQ is still running -
TXQ can still continue to populate related sw_ring till it gets full.
Then it will continue with mempool_put() as usual.
Of-course it means that user who wants to use this feature should 
probably account some extra mbufs for such case, or might be
rxq_sw_ring can have enable/disable flag to mitigate such situation.

As another benefit here - such approach makes possible to use
several TXQs (even from different devices) to rearm same RXQ.

Have to say, that I am still not sure that 10% RX/TX improvement is 
worth bypassing mempool completely and introducing all this extra 
complexity in RX/TX path.
But, if we'll still decide to go ahead with direct-rearming, this
re-arrangement, I think, should help to keep things clear and
avoid introducing new limitations in existing functionality.

WDYT?

Konstantin










More information about the dev mailing list