[dpdk-dev] rte_mbuf.next in 2nd cacheline
    Ananyev, Konstantin 
    konstantin.ananyev at intel.com
       
    Wed Jun 17 20:50:54 CEST 2015
    
    
  
Hi Dave,
> From: Dave Barach (dbarach) [mailto:dbarach at cisco.com]
> Sent: Wednesday, June 17, 2015 4:46 PM
> To: Venkatesan, Venky; Richardson, Bruce; olivier.matz at 6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dear Venky,
> 
> The first thing I noticed - on a specific piece of hardware, yadda yadda yadda - was that the i40e driver speed-path spent an ungodly
> amount of time stalled in i40e_rx_alloc_bufs(.) in rte_mbuf_refcnt_set. Mumble, missing prefetch. I added a stride-of-1 prefetch,
> which made the first stall go away. See below.
> 
> Next thing I noticed: a stall setting mb->next = 0, in the non-scattered RX case. So I added the (commented-out) rte_prefetch0
> (&pfmb->next). At that point, I decided to move the buffer metadata around to eliminate the second prefetch.
> 
> Taken together, these changes made a 10% PPS difference, again in a specific use case.
That seems like a valid point, but why moving next to the first cache-line is considered as the only possible option?
As Bruce suggested before, we can try to get rid of touching next in non-scattered RX routines.
That I think should provide similar performance improvement for i40e/ixgbe fast-path scalar RX code. 
Or you are talking about hit in your app layer code, when you start chain mbufs together?
> 
> Damjan was going to produce a cleaned up version of the rte_mbuf.h diffs, of the form:
> 
> struct rte_mbuf {
> #ifdef CONFIG_OFFLOAD_IN_FIRST_CACHE_LINE
>   Offload-in-first-cache-line;
>   Next-in-second-cache-line;
> #else
>   Offload-in-second-cache-line;
>   Next-in-first-cache-line;r
> #endif
> };
> 
> .along with a parallel change in the kernel module version.
As first, with the proposed below changes ixgbe vector RX routine would be broken.
Of course, it could be fixed by putting even more conditional compilation around -
enable vector RX, only when OFFLOAD_IN_FIRST_CACHE_LINE enabled, etc.
Second, how long it would take before someone else would like to introduce another mbuf fields swap?
All for perfectly good reason of course.
Let say, swap 'hash' and 'segn' (to keep vector RX working),
or 'next' and 'userdata', or put tx_offload into the first line (traffic generators).
I think if we'll go that way  (allow mbuf fields swapping at build-time) we'll end up with
totally unmaintainable code.
Not to mention problems with ABI compatibility) (shared libraries, multiple processes, KNI).
So, I think we better stick with one mbuf format.
If it's absolutely necessary to move next into the first cache, I think it has to be done for all configs.
Though, from what you described with i40e_rx_alloc_bufs() -
that looks like a flaw in particular implementation, that might be fixed without changing mbuf format.
> 
> I'm out of LSU bandwidth / prefetch slots in a number of apps - mostly L2 apps - which will never use h/w offload results. I can see
> your point about not turning dpdk buffer metadata into a garbage can. On the other hand, the problems we face often involve MTU
> => scattered RX, with a mix of small and large packets. As in: enough PPS to care about extra cache-line fills.
> 
> Hope this explains it a bit. We can obviously make these changes in our own repos, but I can't imagine that we're the only folks who
> might not use h/w offload results.
> 
> Thanks. Dave
> 
> 
> diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
> index 9c7be6f..1200361 100644
> --- a/lib/librte_pmd_i40e/i40e_rxtx.c
> +++ b/lib/librte_pmd_i40e/i40e_rxtx.c
> @@ -779,9 +779,15 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> 
>       rxdp = &rxq->rx_ring[alloc_idx];
>       for (i = 0; i < rxq->rx_free_thresh; i++) {
> +                if (i < (rxq->rx_free_thresh - 1)) {
> +                        struct rte_mbuf *pfmb;
> +                        pfmb = rxep[i+1].mbuf;
> +                        rte_prefetch0 (pfmb);
> +                        // rte_prefetch0 (&pfmb->next);
Wonder does your compiler unroll that loop?
If not, wonder, would manually unrolling it (by 4 or so) help?
Konstantin  
> +                }
>             mb = rxep[i].mbuf;
>             rte_mbuf_refcnt_set(mb, 1);
> -           mb->next = NULL;
> +           mb->next = NULL; /* $$$ in second cacheline */
>             mb->data_off = RTE_PKTMBUF_HEADROOM;
>             mb->nb_segs = 1;
>             mb->port = rxq->port_id;
> 
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 55749bc..efd7f4e 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -288,6 +288,12 @@ struct rte_mbuf {
>       uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
>       uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
>       uint16_t reserved;
> +     uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
> +
> +     struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +
> +     /* second cache line - fields only used in slow path or on TX */
> +     MARKER cacheline1 __rte_cache_aligned;
>       union {
>             uint32_t rss;     /**< RSS hash result if RSS enabled */
>             struct {
> @@ -307,18 +313,12 @@ struct rte_mbuf {
>             uint32_t usr;       /**< User defined tags. See rte_distributor_process() */
>       } hash;                   /**< hash information */
> 
> -     uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
> -
> -     /* second cache line - fields only used in slow path or on TX */
> -     MARKER cacheline1 __rte_cache_aligned;
> -
>       union {
>             void *userdata;   /**< Can be used for external metadata */
>             uint64_t udata64; /**< Allow 8-byte userdata on 32-bit */
>       };
> 
>       struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> -     struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> 
>       /* fields to support TX offloads */
>       union {
> 
> 
> 
> 
> Thanks. Dave
> 
> From: Venkatesan, Venky [mailto:venky.venkatesan at intel.com]
> Sent: Wednesday, June 17, 2015 11:03 AM
> To: Richardson, Bruce; Dave Barach (dbarach); olivier.matz at 6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dave,
> 
> Is there a patch that I can look at? From your description what you seem to indicate is that the mbuf structure becomes variant
> between applications - that to me is likely a non-starter. If rte_mbuf is variant between implementations, then  we really have no
> commonality that all VNFs can rely on.
> 
> I may be getting a bit ahead of things here but letting different apps implement different mbuf layouts means, to achieve any level of
> commonality for upper level apps we need to go to accessor functions (not inline macros) to get various components - pretty much
> like ODP does; all that does is add function call overhead per access and that is completely non-performant. We toyed with that about
> 5 years ago and tossed it out.
> 
> Bruce,
> 
> To some extent I need to get an understanding of why this performance drop is actually happening. Since SNB we have had a paired
> cache line prefetcher that brings in the cache line pair. I think the reason that we have the perf issue is that half the mbufs actually
> begin on an odd cache line boundary - i.e. those are the ones that will suffer a cache miss on rte_mbuf.next. Could you verify that this
> is the case, and see what happens if we address all mbufs beginning on an even cache line? That runs into another potential issue with
> 4K aliasing, but I may have a way to avoid that.
> 
> Regards,
> -Venky
> 
> 
> From: Richardson, Bruce
> Sent: Wednesday, June 17, 2015 7:50 AM
> To: Dave Barach (dbarach); olivier.matz at 6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion); Venkatesan, Venky
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Hi,
> 
> it's something we can look at - especially once we see the proposed patch.
> 
> However, my question is, if an app is cycle constrained such that the extra cache-line reference for jumbo frames cause a problem, is
> that app not likely to really suffer performance issues when run with smaller e.g. 256 byte packet sizes?  And in the inverse case, if an
> app can deal with small packets, is it not likely able to take the extra hit for the jumbo frames? This was our thinking when doing the
> cache-line split, but perhaps the logic doesn't hold up in the cases you refer to.
> 
> Regards,
> /Bruce
> 
> From: Dave Barach (dbarach) [mailto:dbarach at cisco.com]
> Sent: Wednesday, June 17, 2015 3:37 PM
> To: Richardson, Bruce; olivier.matz at 6wind.com; Ananyev, Konstantin
> Cc: Damjan Marion (damarion)
> Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> Dear Bruce,
> 
> We've implemented a number of use cases which can't or at least don't currently use hardware offload data. Examples: L2 bridging,
> integrated routing and bridging, various flavors of tunneling, ipv6 MAP, ipv6 segment routing, and so on.
> 
> You're not considering the cost of additional pressure on the load-store unit, memory system, caused when mb->next must be
> prefetched, set, and cleared. It doesn't matter that the packets are large, the cost still exists. As we transition to 40 and 100g NICs,
> large-packet PPS-per-core becomes more of an issue.
> 
> Damjan has offered to make the layout of the buffer metadata configurable, so that folks can "have it their way." Given that it's a ~10
> line change - with minimal comedic potential - it seems like a reasonable way to go.
> 
> Thoughts?
> 
> Thanks. Dave Barach
> Cisco Fellow
> 
> From: Damjan Marion (damarion)
> Sent: Wednesday, June 17, 2015 10:17 AM
> To: Dave Barach (dbarach)
> Subject: Fwd: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> 
> 
> 
> Begin forwarded message:
> 
> From: Bruce Richardson <bruce.richardson at intel.com>
> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline
> Date: 17 Jun 2015 16:06:48 CEST
> To: "Damjan Marion (damarion)" <damarion at cisco.com>
> Cc: Olivier MATZ <olivier.matz at 6wind.com>, "Ananyev, Konstantin" <konstantin.ananyev at intel.com>, "dev at dpdk.org"
> <dev at dpdk.org>
> 
> On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote:
> 
> On 15 Jun 2015, at 16:12, Bruce Richardson <bruce.richardson at intel.com> wrote:
> 
> The next pointers always start out as NULL when the mbuf pool is created. The
> only time it is set to non-NULL is when we have chained mbufs. If we never have
> any chained mbufs, we never need to touch the next field, or even read it - since
> we have the num-segments count in the first cache line. If we do have a multi-segment
> mbuf, it's likely to be a big packet, so we have more processing time available
> and we can then take the hit of setting the next pointer.
> 
> There are applications which are not using rx offload, but they deal with chained mbufs.
> Why they are less important than ones using rx offload? This is something people
> should be able to configure on build time.
> 
> It's not that they are less important, it's that the packet processing cycle count
> budget is going to be greater. A packet which is 64 bytes, or 128 bytes in size
> can make use of a number of RX offloads to reduce it's processing time. However,
> a 64/128 packet is not going to be split across multiple buffers [unless we
> are dealing with a very unusual setup!].
> 
> To handle 64 byte packets at 40G line rate, one has 50 cycles per core per packet
> when running at 3GHz. [3000000000 cycles / 59.5 mpps].
> If we assume that we are dealing with fairly small buffers
> here, and that anything greater than 1k packets are chained, we still have 626
> cycles per 3GHz core per packet to work with for that 1k packet. Given that
> "normal" DPDK buffers are 2k in size, we have over a thousand cycles per packet
> for any packet that is split.
> 
> In summary, packets spread across multiple buffers are large packets, and so have
> larger packet cycle count budgets and so can much better absorb the cost of
> touching a second cache line in the mbuf than a 64-byte packet can. Therefore,
> we optimize for the 64B packet case.
> 
> Hope this clarifies things a bit.
> 
> Regards,
> /Bruce
    
    
More information about the dev
mailing list