[PATCH v8 3/3] mbuf: optimize reset of reinitialized mbufs
Bruce Richardson
bruce.richardson at intel.com
Mon Oct 20 10:46:46 CEST 2025
On Sun, Oct 19, 2025 at 01:45:45PM -0700, Stephen Hemminger wrote:
> On Thu, 9 Oct 2025 18:15:12 +0100
> Bruce Richardson <bruce.richardson at intel.com> wrote:
>
> > On Sat, Aug 23, 2025 at 06:30:02AM +0000, Morten Brørup wrote:
> > > An optimized function for resetting a bulk of newly allocated
> > > reinitialized mbufs (a.k.a. raw mbufs) was added.
> > >
> > > Compared to the normal packet mbuf reset function, it takes advantage of
> > > the following two details:
> > > 1. The 'next' and 'nb_segs' fields are already reset, so resetting them
> > > has been omitted.
> > > 2. When resetting the mbuf, the 'ol_flags' field must indicate whether the
> > > mbuf uses an external buffer, and the 'data_off' field must not exceed the
> > > data room size when resetting the data offset to include the default
> > > headroom.
> > > Unlike the normal packet mbuf reset function, which reads the mbuf itself
> > > to get the information required for resetting these two fields, this
> > > function gets the information from the mempool.
> > >
> > > This makes the function write-only of the mbuf, unlike the normal packet
> > > mbuf reset function, which is read-modify-write of the mbuf.
> > >
> > > Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> > > ---
> > > lib/mbuf/rte_mbuf.h | 74 ++++++++++++++++++++++++++++-----------------
> > > 1 file changed, 46 insertions(+), 28 deletions(-)
> > >
> > > diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> > > index 49c93ab356..6f37a2e91e 100644
> > > --- a/lib/mbuf/rte_mbuf.h
> > > +++ b/lib/mbuf/rte_mbuf.h
> > > @@ -954,6 +954,50 @@ static inline void rte_pktmbuf_reset_headroom(struct rte_mbuf *m)
> > > (uint16_t)m->buf_len);
> > > }
> > >
> > > +/**
> > > + * Reset the fields of a bulk of packet mbufs to their default values.
> > > + *
> > > + * The caller must ensure that the mbufs come from the specified mempool,
> > > + * are direct and properly reinitialized (refcnt=1, next=NULL, nb_segs=1),
> > > + * as done by rte_pktmbuf_prefree_seg().
> > > + *
> > > + * This function should be used with care, when optimization is required.
> > > + * For standard needs, prefer rte_pktmbuf_reset().
> > > + *
> > > + * @param mp
> > > + * The mempool to which the mbuf belongs.
> > > + * @param mbufs
> > > + * Array of pointers to packet mbufs.
> > > + * The array must not contain NULL pointers.
> > > + * @param count
> > > + * Array size.
> > > + */
> > > +static inline void
> > > +rte_mbuf_raw_reset_bulk(struct rte_mempool *mp, struct rte_mbuf **mbufs, unsigned int count)
> > > +{
> > > + uint64_t ol_flags = (rte_pktmbuf_priv_flags(mp) & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) ?
> > > + RTE_MBUF_F_EXTERNAL : 0;
> > > + uint16_t data_off = RTE_MIN_T(RTE_PKTMBUF_HEADROOM, rte_pktmbuf_data_room_size(mp),
> > > + uint16_t);
> > > +
> > > + for (unsigned int idx = 0; idx < count; idx++) {
> > > + struct rte_mbuf *m = mbufs[idx];
> > > +
> > > + m->pkt_len = 0;
> > > + m->tx_offload = 0;
> > > + m->vlan_tci = 0;
> > > + m->vlan_tci_outer = 0;
> > > + m->port = RTE_MBUF_PORT_INVALID;
> >
> > Have you considered doing all initialization using 64-bit stores? It's
> > generally cheaper to do a single 64-bit store than e.g. set of 16-bit ones.
> > This also means that we could remove the restriction on having refcnt and
> > nb_segs already set. As in PMDs, a single store can init data_off, ref_cnt,
> > nb_segs and port.
> >
> > Similarly for packet_type and pkt_len, and data_len/vlan_tci and rss fields
> > etc. For max performance, the whole of the mbuf cleared here can be done in
> > 40 bytes, or 5 64-bit stores. If we do the stores in order, possibly the
> > compiler can even opportunistically coalesce more stores, so we could even
> > end up getting 128-bit or larger stores depending on the ISA compiled for.
> > [Maybe the compiler will do this even if they are not in order, but I'd
> > like to maximize my chances here! :-)]
> >
> > /Bruce
>
> Although it is possible to use less CPU instructions, the performance
> limiting factor is which fields are in cache.
Yes, the cache presence of the target of the stores has a massive effect on
how well the code will perform. However, the number of stores can make a
difference too - especially if you are in store-heavy code. Consider the
number of store operations which would be generated by storing
field-by-field to a burst of 32 packets. With the previous work we have
done on our PMDs, and vectorizing them, we got a noticible benefit from
doing larger vector stores compared to smaller ones!
/Bruce
More information about the dev
mailing list