[PATCH v8 3/3] mbuf: optimize reset of reinitialized mbufs
Morten Brørup
mb at smartsharesystems.com
Fri Mar 6 21:04:09 CET 2026
> From: Rahul Bhansali [mailto:rbhansali at marvell.com]
> Sent: Friday, 6 March 2026 17.04
>
> > From: Morten Brørup <mb at smartsharesystems.com>
> > Sent: Friday, March 6, 2026 8:23 PM
> >
> > > From: Rahul Bhansali [mailto: rbhansali@ marvell. com] > Sent:
> Friday, 6 March 2026 13. 19 > > Please see inline. > > > From: Bruce
> > Richardson <bruce. richardson@ intel. com> > > Sent: Monday, October
> 20, 2025 2: 17
> > > From: Rahul Bhansali [mailto:rbhansali at marvell.com]
> > > Sent: Friday, 6 March 2026 13.19
> > >
> > > Please see inline.
> > >
> > > > From: Bruce Richardson <mailto:bruce.richardson at intel.com>
> > > > Sent: Monday, October 20, 2025 2:17 PM
> > > >
> > > > On Sun, Oct 19, 2025 at 01: 45: 45PM -0700, Stephen Hemminger
> wrote:
> > > > On Thu, 9 Oct 2025 18: 15: 12 +0100 > Bruce Richardson
> > > > <bruce. richardson@ intel. com> wrote: > > > On Sat, Aug 23, 2025
> at
> > > 06: 30: 02AM +0000, Morten Brørup
> > > > On Sun, Oct 19, 2025 at 01:45:45PM -0700, Stephen Hemminger
> wrote:
> > > > > On Thu, 9 Oct 2025 18:15:12 +0100
> > > > > Bruce Richardson <mailto:bruce.richardson at intel.com> wrote:
> > > > >
> > > > > > On Sat, Aug 23, 2025 at 06:30:02AM +0000, Morten Brørup
> wrote:
> > > > > > > An optimized function for resetting a bulk of newly
> allocated
> > > > > > > reinitialized mbufs (a.k.a. raw mbufs) was added.
> > > > > > >
> > > > > > > Compared to the normal packet mbuf reset function, it takes
> > > advantage of
> > > > > > > the following two details:
> > > > > > > 1. The 'next' and 'nb_segs' fields are already reset, so
> > > resetting them
> > > > > > > has been omitted.
> > > > > > > 2. When resetting the mbuf, the 'ol_flags' field must
> indicate
> > > whether the
> > > > > > > mbuf uses an external buffer, and the 'data_off' field must
> not
> > > exceed the
> > > > > > > data room size when resetting the data offset to include
> the
> > > default
> > > > > > > headroom.
> > > > > > > Unlike the normal packet mbuf reset function, which reads
> the
> > > mbuf itself
> > > > > > > to get the information required for resetting these two
> fields,
> > > this
> > > > > > > function gets the information from the mempool.
> > > > > > >
> > > > > > > This makes the function write-only of the mbuf, unlike the
> > > normal packet
> > > > > > > mbuf reset function, which is read-modify-write of the
> mbuf.
> > > > > > >
> > > > > > > Signed-off-by: Morten Brørup
> <mailto:mb at smartsharesystems.com>
> > > > > > > ---
> > > > > > > lib/mbuf/rte_mbuf.h | 74 ++++++++++++++++++++++++++++-----
> ----
> > > --------
> > > > > > > 1 file changed, 46 insertions(+), 28 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> > > > > > > index 49c93ab356..6f37a2e91e 100644
> > > > > > > --- a/lib/mbuf/rte_mbuf.h
> > > > > > > +++ b/lib/mbuf/rte_mbuf.h
> > > > > > > @@ -954,6 +954,50 @@ static inline void
> > > rte_pktmbuf_reset_headroom(struct rte_mbuf *m)
> > > > > > > (uint16_t)m->buf_len);
> > > > > > > }
> > > > > > >
> > > > > > > +/**
> > > > > > > + * Reset the fields of a bulk of packet mbufs to their
> default
> > > values.
> > > > > > > + *
> > > > > > > + * The caller must ensure that the mbufs come from the
> > > specified mempool,
> > > > > > > + * are direct and properly reinitialized (refcnt=1,
> next=NULL,
> > > nb_segs=1),
> > >
> > > [Rahul] For Marvell's CNxx SoCs, mbuf pointers alloc and free are
> > > offloaded to HW for Rx/Tx, so these fields "next and nb_segs" will
> not
> > > be reset to default values by HW.
> > > When packets are coming from wire, we reset these fields in Rx
> > > fastpath, but in case of SW allocated mbuf, we cannot do it in
> > > Marvell's mempool driver as that is unaware of mbuf.
> >
> > It has always been an invariant that mbufs stored in a mempool have
> their "next" and "nb_segs" fields reset.
> > This means that these fields must be reset before free.
> >
> > In an ethdev driver's normal Tx path, the driver calls
> rte_pktmbuf_prefree_seg() before freeing an mbuf.
> > Does your ethdev driver not do that?
> [Rahul] We support this in case of no mbuf fast free offload
> (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload is disabled) .
> When mbuf fast free offload is enabled, then mbuf will free in HW after
> transmission.
Great. This limits the challenge to FAST_FREE Tx processing only.
There are two different solutions to this:
1. When choosing the Tx function for a queue, only select your "fast" Tx function if both RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is set and RTE_ETH_TX_OFFLOAD_MULTI_SEGS is not set.
Note: If MULTI_SEGS is not set, packets are not segmented, so the "next" and "nb_segs" fields are never modified, and thus remain reset.
This limits the "fast" Tx function to support non-segmented packets only, so not the optimal solution.
2. Modify your "fast" Tx function to reset the mbuf "next" and "nb_segs" fields when writing the hardware Tx descriptor (i.e. at an earlier Tx processing stage in the driver), so the fields are already reset when freeing the mbuf.
This allows the "fast" Tx function to support both non-segmented and segmented packets.
>
> >
> > > Is it possible to reset these also in rte_mbuf_raw_reset_bulk()
> itself
> > > for mbuf alloc requests ?
> >
> > Due to the invariant (about mbufs stored in a mempool having their
> "next" and "nb_segs" fields reset), resetting them again in
> > rte_mbuf_raw_reset_bulk() after fetching the mbufs from the mempool
> (i.e. after calling rte_mempool_get_bulk()) is considered
> > unnecessary.
> >
> > PS:
> > I wish for a roadmap towards eliminating this invariant, and instead
> require the ethdev drivers to reset the "nb_segs" and "next" fields in
> > the Rx fastpath instead - where the driver is initializing many other
> mbuf fields anyway, and the additional cost is near-zero.
> > One of the steps in such a roadmap could be to reset the "nb_segs"
> and "next" fields in the rte_mbuf_raw_reset_bulk() function, for
> > ethdev drivers which hasn't implemented it yet.
> >
> > >
> > > > > > > + * as done by rte_pktmbuf_prefree_seg().
> > > > > > > + *
> > > > > > > + * This function should be used with care, when
> optimization
> > > is required.
> > > > > > > + * For standard needs, prefer rte_pktmbuf_reset().
> > > > > > > + *
> > > > > > > + * @param mp
> > > > > > > + * The mempool to which the mbuf belongs.
> > > > > > > + * @param mbufs
> > > > > > > + * Array of pointers to packet mbufs.
> > > > > > > + * The array must not contain NULL pointers.
> > > > > > > + * @param count
> > > > > > > + * Array size.
> > > > > > > + */
> > > > > > > +static inline void
> > > > > > > +rte_mbuf_raw_reset_bulk(struct rte_mempool *mp, struct
> > > rte_mbuf **mbufs, unsigned int count)
> > > > > > > +{
> > > > > > > + uint64_t ol_flags = (rte_pktmbuf_priv_flags(mp) &
> > > RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) ?
> > > > > > > + RTE_MBUF_F_EXTERNAL : 0;
> > > > > > > + uint16_t data_off = RTE_MIN_T(RTE_PKTMBUF_HEADROOM,
> > > rte_pktmbuf_data_room_size(mp),
> > > > > > > + uint16_t);
> > > > > > > +
> > > > > > > + for (unsigned int idx = 0; idx < count; idx++) {
> > > > > > > + struct rte_mbuf *m = mbufs[idx];
> > > > > > > +
> > > > > > > + m->pkt_len = 0;
> > > > > > > + m->tx_offload = 0;
> > > > > > > + m->vlan_tci = 0;
> > > > > > > + m->vlan_tci_outer = 0;
> > > > > > > + m->port = RTE_MBUF_PORT_INVALID;
> > > > > >
> > > > > > Have you considered doing all initialization using 64-bit
> stores?
> > > It's
> > > > > > generally cheaper to do a single 64-bit store than e.g. set
> of
> > > 16-bit ones.
> > > > > > This also means that we could remove the restriction on
> having
> > > refcnt and
> > > > > > nb_segs already set. As in PMDs, a single store can init
> > > data_off, ref_cnt,
> > > > > > nb_segs and port.
> > > > > >
> > > > > > Similarly for packet_type and pkt_len, and data_len/vlan_tci
> and
> > > rss fields
> > > > > > etc. For max performance, the whole of the mbuf cleared here
> can
> > > be done in
> > > > > > 40 bytes, or 5 64-bit stores. If we do the stores in order,
> > > possibly the
> > > > > > compiler can even opportunistically coalesce more stores, so
> we
> > > could even
> > > > > > end up getting 128-bit or larger stores depending on the ISA
> > > compiled for.
> > > > > > [Maybe the compiler will do this even if they are not in
> order,
> > > but I'd
> > > > > > like to maximize my chances here! :-)]
> > > > > >
> > > > > > /Bruce
> > > > >
> > > > > Although it is possible to use less CPU instructions, the
> > > performance
> > > > > limiting factor is which fields are in cache.
> > > >
> > > > Yes, the cache presence of the target of the stores has a massive
> > > effect on
> > > > how well the code will perform. However, the number of stores can
> > > make a
> > > > difference too - especially if you are in store-heavy code.
> Consider
> > > the
> > > > number of store operations which would be generated by storing
> > > > field-by-field to a burst of 32 packets. With the previous work
> we
> > > have
> > > > done on our PMDs, and vectorizing them, we got a noticible
> benefit
> > > from
> > > > doing larger vector stores compared to smaller ones!
> > > >
> > > > /Bruce
More information about the dev
mailing list