[dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half

Morten Brørup mb at smartsharesystems.com
Sun Nov 1 10:12:43 CET 2020
Previous message: [dpdk-dev] [PATCH] mlx5: add ConnectX-7 and Bluefield-3 device IDs
Next message: [dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> From: Thomas Monjalon [mailto:thomas at monjalon.net]
> Sent: Saturday, October 31, 2020 9:41 PM
> 
> 31/10/2020 19:20, Morten Brørup:
> > Thomas,
> >
> > Adding my thoughts to the already detailed feedback on this important
> patch...
> >
> > The first cache line is not inherently "hotter" than the second. The
> hotness depends on their usage.
> >
> > The mbuf cacheline1 marker has the following comment:
> > /* second cache line - fields only used in slow path or on TX */
> >
> > In other words, the second cache line is intended not to be touched in
> fast path RX.
> >
> > I do not think this is true anymore. Not even with simple non-scattered
> RX. And regression testing probably didn't catch this, because the tests
> perform TX after RX, so the cache miss moved from TX to RX and became a
> cache hit in TX instead. (I may be wrong about this claim, but it's not
> important for the discussion.)
> >
> > I think the right question for this patch is: Can we achieve this - not
> using the second cache line for fast path RX - again by putting the right
> fields in the first cache line?
> >
> > Probably not in all cases, but perhaps for some...
> >
> > Consider the application scenarios.
> >
> > When a packet is received, one of three things happens to it:
> > 1. It is immediately transmitted on one or more ports.
> > 2. It is immediately discarded, e.g. by a firewall rule.
> > 3. It is put in some sort of queue, e.g. a ring for the next pipeline
> stage, or in a QoS queue.
> >
> > 1. If the packet is immediately transmitted, the m->tx_offload field in
> the second cache line will be touched by the application and TX function
> anyway, so we don't need to optimize the mbuf layout for this scenario.
> >
> > 2. The second scenario touches m->pool no matter how it is implemented.
> The application can avoid touching m->next by using rte_mbuf_raw_free(),
> knowing that the mbuf came directly from RX and thus no other fields have
> been touched. In this scenario, we want m->pool in the first cache line.
> >
> > 3. Now, let's consider the third scenario, where RX is followed by
> enqueue into a ring. If the application does nothing but put the packet
> into a ring, we don't need to move anything into the first cache line. But
> applications usually does more... So it is application specific what would
> be good to move to the first cache line:
> >
> > A. If the application does not use segmented mbufs, and performs analysis
> and preparation for transmission in the initial pipeline stages, and only
> the last pipeline stage performs TX, we could move m->tx_offload to the
> first cache line, which would keep the second cache line cold until the
> actual TX happens in the last pipeline stage - maybe even after the packet
> has waited in a QoS queue for a long time, and its cache lines have gone
> cold.
> >
> > B. If the application uses segmented mbufs on RX, it might make sense
> moving m->next to the first cache line. (We don't use segmented mbufs, so
> I'm not sure about this.)
> >
> >
> > However, reality perhaps beats theory:
> >
> > Looking at the E1000 PMD, it seems like even its non-scattered RX
> function, eth_igb_recv_pkts(), sets m->next. If it only kept its own free
> pool pre-initialized instead... I haven't investigated other PMDs, except
> briefly looking at the mlx5 PMD, and it seems like it doesn't touch m->next
> in RX.
> >
> > I haven't looked deeper into how m->pool is being used by RX in PMDs, but
> I suppose that it isn't touched in RX.
> >
> > <rant on>
> > If only we had a performance test where RX was not immediately followed
> by TX, but the packets were passed through a large queue in-between, so RX
> cache misses were not free of charge because they transform TX cache misses
> into cache hits instead...
> > <rant off>
> >
> > Whatever you choose, I am sure that most applications will find it more
> useful than the timestamp. :-)
> 
> Thanks for the thoughts Morten.
> I believe we need benchmarks of different scenarios with different drivers.
>

If we are only allowed to modify the mbuf structure this one more time, we should look forward, not backwards!

If we move m->tx_offload to the first cache line, applications using simple, non-scattered packet mbufs would never even need to touch the second cache line, except for freeing the mbuf (which needs to read m->pool).

And this leads to my next suggestion...

One thing has always puzzled me: Why do we use 64 bits to indicate which memory pool an mbuf belongs to? The portid only uses 16 bits and an indirection index. Why don't we use the same kind of indirection index for mbuf pools?

I can easily imagine using one mbuf pool (or perhaps a few pools) per CPU socket (or per physical memory bus closest to an attached NIC), but not more than 256 mbuf memory pools in total. So, let's introduce an mbufpoolid like the portid, and cut this mbuf field down from 64 to 8 bits.

If we also cut down m->pkt_len from 32 to 24 bits, we can get the 8 bit mbuf pool index into the first cache line at no additional cost.

In other words: This would free up another 64 bit field in the mbuf structure!


And even though the m->next pointer for scattered packets resides in the second cache line, the libraries and application knows that m->next is NULL when m->nb_segs is 1. This proves that my suggestion would make touching the second cache line unnecessary (in simple cases), even for re-initializing the mbuf.


And now I will proceed out on a tangent with two more independent thoughts, so feel free to ignore.

Consider a multi CPU socket system with one mbuf pool per CPU socket, the NICs attached to each CPU socket use an RX mbuf pool with RAM on the same CPU socket. I would imagine that (re-)initializing these mbufs could be faster if performed only on a CPU on the same socket. If this is the case, mbufs should be re-initialized as part of the RX preparation at ingress, not as part of the mbuf free at egress.

Perhaps some microarchitectures are faster to compare nb_segs==0 than nb_segs==1. If so, nb_segs could be redefined to mean number of additional segments, rather than number of segments.


PS: I have added two more mlx5 maintainers to the discussion; they might have qualified opinions about how PMDs could benefit from this.


Med venlig hilsen / kind regards
- Morten Brørup
Previous message: [dpdk-dev] [PATCH] mlx5: add ConnectX-7 and Bluefield-3 device IDs
Next message: [dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list