[PATCH v8 3/3] mbuf: optimize reset of reinitialized mbufs
Rahul Bhansali
rbhansali at marvell.com
Wed Apr 1 08:12:46 CEST 2026
> -----Original Message-----
> From: Morten Brørup <mb at smartsharesystems.com>
> Sent: Saturday, March 7, 2026 1:34 AM
> To: Rahul Bhansali <rbhansali at marvell.com>; Bruce Richardson <bruce.richardson at intel.com>; Stephen Hemminger
> <stephen at networkplumber.org>
> Cc: dev at dpdk.org; Thomas Monjalon <thomas at monjalon.net>; Konstantin Ananyev <konstantin.ananyev at huawei.com>; Andrew
> Rybchenko <andrew.rybchenko at oktetlabs.ru>; Ivan Malov <ivan.malov at arknetworks.am>; Chengwen Feng
> <fengchengwen at huawei.com>; Jerin Jacob <jerinj at marvell.com>; Nithin Kumar Dabilpuram <ndabilpuram at marvell.com>; Ashwin
> Sekhar T K <asekhar at marvell.com>
> Subject: [EXTERNAL] RE: [PATCH v8 3/3] mbuf: optimize reset of reinitialized mbufs
>
> > From: Rahul Bhansali [mailto: rbhansali@ marvell. com] > Sent: Friday, 6 March 2026 17. 04 > > > From: Morten Brørup
> <mb@ smartsharesystems. com> > > Sent: Friday, March 6, 2026 8: 23 PM > > > > > From: Rahul
> > From: Rahul Bhansali [mailto:rbhansali at marvell.com]
> > Sent: Friday, 6 March 2026 17.04
> >
> > > From: Morten Brørup <mailto:mb at smartsharesystems.com>
> > > Sent: Friday, March 6, 2026 8:23 PM
> > >
> > > > From: Rahul Bhansali [mailto: rbhansali@ marvell. com] > Sent:
> > Friday, 6 March 2026 13. 19 > > Please see inline. > > > From: Bruce
> > > Richardson <bruce. richardson@ intel. com> > > Sent: Monday, October
> > 20, 2025 2: 17
> > > > From: Rahul Bhansali [mailto:rbhansali at marvell.com]
> > > > Sent: Friday, 6 March 2026 13.19
> > > >
> > > > Please see inline.
> > > >
> > > > > From: Bruce Richardson <mailto:bruce.richardson at intel.com>
> > > > > Sent: Monday, October 20, 2025 2:17 PM
> > > > >
> > > > > On Sun, Oct 19, 2025 at 01: 45: 45PM -0700, Stephen Hemminger
> > wrote:
> > > > > On Thu, 9 Oct 2025 18: 15: 12 +0100 > Bruce Richardson
> > > > > <bruce. richardson@ intel. com> wrote: > > > On Sat, Aug 23, 2025
> > at
> > > > 06: 30: 02AM +0000, Morten Brørup
> > > > > On Sun, Oct 19, 2025 at 01:45:45PM -0700, Stephen Hemminger
> > wrote:
> > > > > > On Thu, 9 Oct 2025 18:15:12 +0100
> > > > > > Bruce Richardson <mailto:bruce.richardson at intel.com> wrote:
> > > > > >
> > > > > > > On Sat, Aug 23, 2025 at 06:30:02AM +0000, Morten Brørup
> > wrote:
> > > > > > > > An optimized function for resetting a bulk of newly
> > allocated
> > > > > > > > reinitialized mbufs (a.k.a. raw mbufs) was added.
> > > > > > > >
> > > > > > > > Compared to the normal packet mbuf reset function, it takes
> > > > advantage of
> > > > > > > > the following two details:
> > > > > > > > 1. The 'next' and 'nb_segs' fields are already reset, so
> > > > resetting them
> > > > > > > > has been omitted.
> > > > > > > > 2. When resetting the mbuf, the 'ol_flags' field must
> > indicate
> > > > whether the
> > > > > > > > mbuf uses an external buffer, and the 'data_off' field must
> > not
> > > > exceed the
> > > > > > > > data room size when resetting the data offset to include
> > the
> > > > default
> > > > > > > > headroom.
> > > > > > > > Unlike the normal packet mbuf reset function, which reads
> > the
> > > > mbuf itself
> > > > > > > > to get the information required for resetting these two
> > fields,
> > > > this
> > > > > > > > function gets the information from the mempool.
> > > > > > > >
> > > > > > > > This makes the function write-only of the mbuf, unlike the
> > > > normal packet
> > > > > > > > mbuf reset function, which is read-modify-write of the
> > mbuf.
> > > > > > > >
> > > > > > > > Signed-off-by: Morten Brørup
> > <mailto:mb at smartsharesystems.com>
> > > > > > > > ---
> > > > > > > > lib/mbuf/rte_mbuf.h | 74 ++++++++++++++++++++++++++++-----
> > ----
> > > > --------
> > > > > > > > 1 file changed, 46 insertions(+), 28 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> > > > > > > > index 49c93ab356..6f37a2e91e 100644
> > > > > > > > --- a/lib/mbuf/rte_mbuf.h
> > > > > > > > +++ b/lib/mbuf/rte_mbuf.h
> > > > > > > > @@ -954,6 +954,50 @@ static inline void
> > > > rte_pktmbuf_reset_headroom(struct rte_mbuf *m)
> > > > > > > > (uint16_t)m->buf_len);
> > > > > > > > }
> > > > > > > >
> > > > > > > > +/**
> > > > > > > > + * Reset the fields of a bulk of packet mbufs to their
> > default
> > > > values.
> > > > > > > > + *
> > > > > > > > + * The caller must ensure that the mbufs come from the
> > > > specified mempool,
> > > > > > > > + * are direct and properly reinitialized (refcnt=1,
> > next=NULL,
> > > > nb_segs=1),
> > > >
> > > > [Rahul] For Marvell's CNxx SoCs, mbuf pointers alloc and free are
> > > > offloaded to HW for Rx/Tx, so these fields "next and nb_segs" will
> > not
> > > > be reset to default values by HW.
> > > > When packets are coming from wire, we reset these fields in Rx
> > > > fastpath, but in case of SW allocated mbuf, we cannot do it in
> > > > Marvell's mempool driver as that is unaware of mbuf.
> > >
> > > It has always been an invariant that mbufs stored in a mempool have
> > their "next" and "nb_segs" fields reset.
> > > This means that these fields must be reset before free.
> > >
> > > In an ethdev driver's normal Tx path, the driver calls
> > rte_pktmbuf_prefree_seg() before freeing an mbuf.
> > > Does your ethdev driver not do that?
> > [Rahul] We support this in case of no mbuf fast free offload
> > (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload is disabled) .
> > When mbuf fast free offload is enabled, then mbuf will free in HW after
> > transmission.
>
> Great. This limits the challenge to FAST_FREE Tx processing only.
> There are two different solutions to this:
>
> 1. When choosing the Tx function for a queue, only select your "fast" Tx function if both RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is set
> and RTE_ETH_TX_OFFLOAD_MULTI_SEGS is not set.
> Note: If MULTI_SEGS is not set, packets are not segmented, so the "next" and "nb_segs" fields are never modified, and thus remain
> reset.
> This limits the "fast" Tx function to support non-segmented packets only, so not the optimal solution.
>
> 2. Modify your "fast" Tx function to reset the mbuf "next" and "nb_segs" fields when writing the hardware Tx descriptor (i.e. at an
> earlier Tx processing stage in the driver), so the fields are already reset when freeing the mbuf.
> This allows the "fast" Tx function to support both non-segmented and segmented packets.
Sorry, I missed to reply. We already support fastpath functions based on offload flags and will make changes for mbuf "next" and "nb_segs" in multi-seg offloads Tx path.
>
> >
> > >
> > > > Is it possible to reset these also in rte_mbuf_raw_reset_bulk()
> > itself
> > > > for mbuf alloc requests ?
> > >
> > > Due to the invariant (about mbufs stored in a mempool having their
> > "next" and "nb_segs" fields reset), resetting them again in
> > > rte_mbuf_raw_reset_bulk() after fetching the mbufs from the mempool
> > (i.e. after calling rte_mempool_get_bulk()) is considered
> > > unnecessary.
> > >
> > > PS:
> > > I wish for a roadmap towards eliminating this invariant, and instead
> > require the ethdev drivers to reset the "nb_segs" and "next" fields in
> > > the Rx fastpath instead - where the driver is initializing many other
> > mbuf fields anyway, and the additional cost is near-zero.
> > > One of the steps in such a roadmap could be to reset the "nb_segs"
> > and "next" fields in the rte_mbuf_raw_reset_bulk() function, for
> > > ethdev drivers which hasn't implemented it yet.
> > >
> > > >
> > > > > > > > + * as done by rte_pktmbuf_prefree_seg().
> > > > > > > > + *
> > > > > > > > + * This function should be used with care, when
> > optimization
> > > > is required.
> > > > > > > > + * For standard needs, prefer rte_pktmbuf_reset().
> > > > > > > > + *
> > > > > > > > + * @param mp
> > > > > > > > + * The mempool to which the mbuf belongs.
> > > > > > > > + * @param mbufs
> > > > > > > > + * Array of pointers to packet mbufs.
> > > > > > > > + * The array must not contain NULL pointers.
> > > > > > > > + * @param count
> > > > > > > > + * Array size.
> > > > > > > > + */
> > > > > > > > +static inline void
> > > > > > > > +rte_mbuf_raw_reset_bulk(struct rte_mempool *mp, struct
> > > > rte_mbuf **mbufs, unsigned int count)
> > > > > > > > +{
> > > > > > > > + uint64_t ol_flags = (rte_pktmbuf_priv_flags(mp) &
> > > > RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) ?
> > > > > > > > + RTE_MBUF_F_EXTERNAL : 0;
> > > > > > > > + uint16_t data_off = RTE_MIN_T(RTE_PKTMBUF_HEADROOM,
> > > > rte_pktmbuf_data_room_size(mp),
> > > > > > > > + uint16_t);
> > > > > > > > +
> > > > > > > > + for (unsigned int idx = 0; idx < count; idx++) {
> > > > > > > > + struct rte_mbuf *m = mbufs[idx];
> > > > > > > > +
> > > > > > > > + m->pkt_len = 0;
> > > > > > > > + m->tx_offload = 0;
> > > > > > > > + m->vlan_tci = 0;
> > > > > > > > + m->vlan_tci_outer = 0;
> > > > > > > > + m->port = RTE_MBUF_PORT_INVALID;
> > > > > > >
> > > > > > > Have you considered doing all initialization using 64-bit
> > stores?
> > > > It's
> > > > > > > generally cheaper to do a single 64-bit store than e.g. set
> > of
> > > > 16-bit ones.
> > > > > > > This also means that we could remove the restriction on
> > having
> > > > refcnt and
> > > > > > > nb_segs already set. As in PMDs, a single store can init
> > > > data_off, ref_cnt,
> > > > > > > nb_segs and port.
> > > > > > >
> > > > > > > Similarly for packet_type and pkt_len, and data_len/vlan_tci
> > and
> > > > rss fields
> > > > > > > etc. For max performance, the whole of the mbuf cleared here
> > can
> > > > be done in
> > > > > > > 40 bytes, or 5 64-bit stores. If we do the stores in order,
> > > > possibly the
> > > > > > > compiler can even opportunistically coalesce more stores, so
> > we
> > > > could even
> > > > > > > end up getting 128-bit or larger stores depending on the ISA
> > > > compiled for.
> > > > > > > [Maybe the compiler will do this even if they are not in
> > order,
> > > > but I'd
> > > > > > > like to maximize my chances here! :-)]
> > > > > > >
> > > > > > > /Bruce
> > > > > >
> > > > > > Although it is possible to use less CPU instructions, the
> > > > performance
> > > > > > limiting factor is which fields are in cache.
> > > > >
> > > > > Yes, the cache presence of the target of the stores has a massive
> > > > effect on
> > > > > how well the code will perform. However, the number of stores can
> > > > make a
> > > > > difference too - especially if you are in store-heavy code.
> > Consider
> > > > the
> > > > > number of store operations which would be generated by storing
> > > > > field-by-field to a burst of 32 packets. With the previous work
> > we
> > > > have
> > > > > done on our PMDs, and vectorizing them, we got a noticible
> > benefit
> > > > from
> > > > > doing larger vector stores compared to smaller ones!
> > > > >
> > > > > /Bruce
More information about the dev
mailing list