[dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half

Ananyev, Konstantin konstantin.ananyev at intel.com
Thu Nov 5 01:25:33 CET 2020



> 
> Hi,
> 
> On Tue, Nov 03, 2020 at 04:03:46PM +0100, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Slava Ovsiienko
> > > Sent: Tuesday, November 3, 2020 3:03 PM
> > >
> > > Hi, Morten
> > >
> > > > From: Morten Brørup <mb at smartsharesystems.com>
> > > > Sent: Tuesday, November 3, 2020 14:10
> > > >
> > > > > From: Thomas Monjalon [mailto:thomas at monjalon.net]
> > > > > Sent: Monday, November 2, 2020 4:58 PM
> > > > >
> > > > > +Cc techboard
> > > > >
> > > > > We need benchmark numbers in order to take a decision.
> > > > > Please all, prepare some arguments and numbers so we can discuss
> > > the
> > > > > mbuf layout in the next techboard meeting.
> 
> I did some quick tests, and it appears to me that just moving the pool
> pointer to the first cache line has not a significant impact.

Hmm, as I remember Thomas mentioned about 5%+ improvement
with that change. Though I suppose a lot depends from actual test-case. 
Would be good to know when it does help and when it doesn't.

> 
> However, I agree with Morten that there is some room for optimization
> around m->pool: I did a hack in the ixgbe driver to assume there is only
> one mbuf pool. This simplifies a lot the freeing of mbufs in Tx, because
> we don't have to group them in bulks that shares the same pool (see
> ixgbe_tx_free_bufs()). The impact of this hack is quite good: +~5% on a
> real-life forwarding use case.

I think we already have such optimization ability within DPDK:
#define DEV_TX_OFFLOAD_MBUF_FAST_FREE   0x00010000
/**< Device supports optimization for fast release of mbufs.
 *   When set application must guarantee that per-queue all mbufs comes from
 *   the same mempool and has refcnt = 1.
 */

Seems over-optimistic to me, but many PMDs do support it.

> 
> It is maybe possible to store the pool in the sw ring to avoid a later
> access to m->pool. Having a pool index as suggested by Morten would also
> help to reduce used room in sw ring in this case. But this is a bit
> off-topic :)
> 
> 
> 
> > > > I propose that the techboard considers this from two angels:
> > > >
> > > > 1. Long term goals and their relative priority. I.e. what can be
> > > achieved with
> > > > wide-ranging modifications, requiring yet another ABI break and due
> > > notices.
> > > >
> > > > 2. Short term goals, i.e. what can be achieved for this release.
> > > >
> > > >
> > > > My suggestions follow...
> > > >
> > > > 1. Regarding long term goals:
> > > >
> > > > I have argued that simple forwarding of non-segmented packets using
> > > only the
> > > > first mbuf cache line can be achieved by making three
> > > > modifications:
> > > >
> > > > a) Move m->tx_offload to the first cache line.
> > > Not all PMDs use this field on Tx. HW might support the checksum
> > > offloads
> > > directly, not requiring these fields at all.
> 
> To me, a driver should use m->tx_offload, because the application
> specifies the offset where the checksum has to be done, in case the hw
> is not able to recognize the protocol.
> 
> > > > b) Use an 8 bit pktmbuf mempool index in the first cache line,
> > > >    instead of the 64 bit m->pool pointer in the second cache line.
> > > 256 mpool looks enough, as for me. Regarding the indirect access to the
> > > pool
> > > (via some table) - it might introduce some performance impact.
> >
> > It might, but I hope that it is negligible, so the benefits outweigh the disadvantages.
> >
> > It would have to be measured, though.
> >
> > And m->pool is only used for free()'ing (and detach()'ing) mbufs.
> >
> > > For example,
> > > mlx5 PMD strongly relies on pool field for allocating mbufs in Rx
> > > datapath.
> > > We're going to update (o-o, we found point to optimize), but for now it
> > > does.
> >
> > Without looking at the source code, I don't think the PMD is using m->pool in the RX datapath, I think it is using a pool dedicated to a
> receive queue used for RX descriptors in the PMD (i.e. driver->queue->pool).
> >
> > >
> > > > c) Do not access m->next when we know that it is NULL.
> > > >    We can use m->nb_segs == 1 or some other invariant as the gate.
> > > >    It can be implemented by adding an m->next accessor function:
> > > >    struct rte_mbuf * rte_mbuf_next(struct rte_mbuf * m)
> > > >    {
> > > >        return m->nb_segs == 1 ? NULL : m->next;
> > > >    }
> > >
> > > Sorry, not sure about this. IIRC, nb_segs is valid in the first
> > > segment/mbuf  only.
> > > If we have the 4 segments in the pkt we see nb_seg=4 in the first one,
> > > and the nb_seg=1
> > > in the others. The next field is NULL in the last mbuf only. Am I wrong
> > > and miss something ?
> >
> > You are correct.
> >
> > This would have to be updated too. Either by increasing m->nb_seg in the following segments, or by splitting up relevant functions into
> functions for working on first segments (incl. non-segmented packets), and functions for working on following segments of segmented
> packets.
> 
> Instead of maintaining a valid nb_segs, a HAS_NEXT flag would be easier
> to implement. However it means that an accessor needs to be used instead
> of any m->next access.
> 
> > > > Regarding the priority of this goal, I guess that simple forwarding
> > > of non-
> > > > segmented packets is probably the path taken by the majority of
> > > packets
> > > > handled by DPDK.
> > > >
> > > > An alternative goal could be:
> > > > Do not touch the second cache line during RX.
> > > > A comment in the mbuf structure says so, but it is not true anymore.
> > > >
> > > > (I guess that regression testing didn't catch this because the tests
> > > perform TX
> > > > immediately after RX, so the cache miss just moves from the TX to the
> > > RX part
> > > > of the test application.)
> > > >
> > > >
> > > > 2. Regarding short term goals:
> > > >
> > > > The current DPDK source code looks to me like m->next is the most
> > > frequently
> > > > accessed field in the second cache line, so it makes sense moving
> > > this to the
> > > > first cache line, rather than m->pool.
> > > > Benchmarking may help here.
> > >
> > > Moreover, for the segmented packets the packet size is supposed to be
> > > large,
> > > and it imposes the relatively low packet rate, so probably optimization
> > > of
> > > moving next to the 1st cache line might be negligible at all. Just
> > > compare 148Mpps of
> > > 64B pkts and 4Mpps of 3000B pkts over 100Gbps link. Currently we are on
> > > benchmarking
> > > and did not succeed yet on difference finding. The benefit can't be
> > > expressed in mpps delta,
> > > we should measure CPU clocks, but Rx queue is almost always empty - we
> > > have an empty
> > > loops. So, if we have the boost - it is extremely hard to catch one.
> >
> > Very good point regarding the value of such an optimization, Slava!
> >
> > And when free()'ing packets, both m->next and m->pool are touched.
> >
> > So perhaps the free()/detach() functions in the mbuf library can be modified to handle first segments (and non-segmented packets) and
> following segments differently, so accessing m->next can be avoided for non-segmented packets. Then m->pool should be moved to the
> first cache line.
> >
> 
> I also think that Moving m->pool without doing something else about
> m->next is probably useless. And it's too late for 20.11 to do
> additionnal changes, so I suggest to postpone the field move to 21.11,
> once we have a clearer view of possible optimizations.
> 
> Olivier


More information about the dev mailing list