[dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half

Morten Brørup mb at smartsharesystems.com
Tue Nov 3 13:10:05 CET 2020


> From: Thomas Monjalon [mailto:thomas at monjalon.net]
> Sent: Monday, November 2, 2020 4:58 PM
> 
> +Cc techboard
> 
> We need benchmark numbers in order to take a decision.
> Please all, prepare some arguments and numbers so we can discuss
> the mbuf layout in the next techboard meeting.

I propose that the techboard considers this from two angels:

1. Long term goals and their relative priority. I.e. what can be
achieved with wide-ranging modifications, requiring yet another ABI
break and due notices.

2. Short term goals, i.e. what can be achieved for this release.


My suggestions follow...

1. Regarding long term goals:

I have argued that simple forwarding of non-segmented packets using
only the first mbuf cache line can be achieved by making three
modifications:

a) Move m->tx_offload to the first cache line.
b) Use an 8 bit pktmbuf mempool index in the first cache line,
   instead of the 64 bit m->pool pointer in the second cache line.
c) Do not access m->next when we know that it is NULL.
   We can use m->nb_segs == 1 or some other invariant as the gate.
   It can be implemented by adding an m->next accessor function:
   struct rte_mbuf * rte_mbuf_next(struct rte_mbuf * m)
   {
       return m->nb_segs == 1 ? NULL : m->next;
   }

Regarding the priority of this goal, I guess that simple forwarding
of non-segmented packets is probably the path taken by the majority
of packets handled by DPDK.


An alternative goal could be:
Do not touch the second cache line during RX.
A comment in the mbuf structure says so, but it is not true anymore.

(I guess that regression testing didn't catch this because the tests
perform TX immediately after RX, so the cache miss just moves from
the TX to the RX part of the test application.)


2. Regarding short term goals:

The current DPDK source code looks to me like m->next is the most
frequently accessed field in the second cache line, so it makes sense
moving this to the first cache line, rather than m->pool.
Benchmarking may help here.

If we - without breaking the ABI - can introduce a gate to avoid
accessing m->next when we know that it is NULL, we should keep it in
the second cache line.

In this case, I would prefer to move m->tx_offload to the first cache
line, thereby providing a field available for application use, until
the application prepares the packet for transmission.


> 
> 
> 01/11/2020 21:59, Morten Brørup:
> > > From: Thomas Monjalon [mailto:thomas at monjalon.net]
> > > Sent: Sunday, November 1, 2020 5:38 PM
> > >
> > > 01/11/2020 10:12, Morten Brørup:
> > > > One thing has always puzzled me:
> > > > Why do we use 64 bits to indicate which memory pool
> > > > an mbuf belongs to?
> > > > The portid only uses 16 bits and an indirection index.
> > > > Why don't we use the same kind of indirection index for mbuf
> pools?
> > >
> > > I wonder what would be the cost of indirection. Probably
> neglectible.
> >
> > Probably. The portid does it, and that indirection is heavily used
> everywhere.
> >
> > The size of mbuf memory pool indirection array should be compile time
> configurable, like the size of the portid indirection array.
> >
> > And for reference, the indirection array will fit into one cache line
> if we default to 8 mbuf pools, thus supporting an 8 CPU socket system
> with one mbuf pool per CPU socket, or a 4 CPU socket system with two
> mbuf pools per CPU socket.
> >
> > (And as a side note: Our application is optimized for single-socket
> systems, and we only use one mbuf pool. I guess many applications were
> developed without carefully optimizing for multi-socket systems, and
> also just use one mbuf pool. In these cases, the mbuf structure doesn't
> really need a pool field. But it is still there, and the DPDK libraries
> use it, so we didn't bother removing it.)
> >
> > > I think it is a good proposal...
> > > ... for next year, after a deprecation notice.
> > >
> > > > I can easily imagine using one mbuf pool (or perhaps a few pools)
> > > > per CPU socket (or per physical memory bus closest to an attached
> NIC),
> > > > but not more than 256 mbuf memory pools in total.
> > > > So, let's introduce an mbufpoolid like the portid,
> > > > and cut this mbuf field down from 64 to 8 bits.
> 
> We will need to measure the perf of the solution.
> There is a chance for the cost to be too much high.
> 
> 
> > > > If we also cut down m->pkt_len from 32 to 24 bits,
> > >
> > > Who is using packets larger than 64k? Are 16 bits enough?
> >
> > I personally consider 64k a reasonable packet size limit. Exotic
> applications with even larger packets would have to live with this
> constraint. But let's see if there are any objections. For reference,
> 64k corresponds to ca. 44 Ethernet (1500 byte) packets.
> >
> > (The limit could be 65535 bytes, to avoid translation of the value 0
> into 65536 bytes.)
> >
> > This modification would go nicely hand in hand with the mbuf pool
> indirection modification.
> >
> > ... after yet another round of ABI stability discussions,
> depreciation notices, and so on. :-)
> 
> After more thoughts, I'm afraid 64k is too small in some cases.
> And 24-bit manipulation would probably break performance.
> I'm afraid we are stuck with 32-bit length.

Yes, 24 bit manipulation would probably break performance.

Perhaps a solution exists with 16 bits (least significant bits) for
the common cases, and 8 bits more (most significant bits) for the less
common cases. Just thinking out loud here...

> 
> > > > we can get the 8 bit mbuf pool index into the first cache line
> > > > at no additional cost.
> > >
> > > I like the idea.
> > > It means we don't need to move the pool pointer now,
> > > i.e. it does not have to replace the timestamp field.
> >
> > Agreed! Don't move m->pool to the first cache line; it is not used
> for RX.
> >
> > >
> > > > In other words: This would free up another 64 bit field in the
> mbuf
> > > structure!
> > >
> > > That would be great!
> > >
> > >
> > > > And even though the m->next pointer for scattered packets resides
> > > > in the second cache line, the libraries and application knows
> > > > that m->next is NULL when m->nb_segs is 1.
> > > > This proves that my suggestion would make touching
> > > > the second cache line unnecessary (in simple cases),
> > > > even for re-initializing the mbuf.
> > >
> > > So you think the "next" pointer should stay in the second half of
> mbuf?
> > >
> > > I feel you would like to move the Tx offloads in the first half
> > > to improve performance of very simple apps.
> >
> > "Very simple apps" sounds like a minority of apps. I would rather say
> "very simple packet handling scenarios", e.g. forwarding of normal size
> non-segmented packets. I would guess that the vast majority of packets
> handled by DPDK applications actually match this scenario. So I'm
> proposing to optimize for what I think is the most common scenario.
> >
> > If segmented packets are common, then m->next could be moved to the
> first cache line. But it will only improve the pure RX steps of the
> pipeline. When preparing the packet for TX, m->tx_offloads will need to
> be set, and the second cache line comes into play. So I'm wondering how
> big the benefit of having m->next in the first cache line really is -
> assuming that m->nb_segs will be checked before accessing m->next.
> >
> > > I am thinking the opposite: we could have some dynamic fields space
> > > in the first half to improve performance of complex Rx.
> > > Note: we can add a flag hint for field registration in this first
> half.
> > >
> >
> > I have had the same thoughts. However, I would prefer being able to
> forward ordinary packets without using the second mbuf cache line at
> all (although only in specific scenarios like my example above).
> >
> > Furthermore, the application can abuse the 64 bit m->tx_offload field
> for private purposes until it is time to prepare the packet for TX and
> pass it on to the driver. This hack somewhat resembles a dynamic field
> in the first cache line, and will not be possible if the m->pool or m-
> >next field is moved there.
> 
> 
> 



More information about the dev mailing list