[RFC] ethdev: fix UAF in rte_eth_tx_burst with filtering

Morten Brørup mb at smartsharesystems.com
Mon Nov 10 09:34:53 CET 2025


+TO: Konstantin, regarding Stage Ordered Ring.

> From: Andrew Rybchenko [mailto:andrew.rybchenko at oktetlabs.ru]
> Sent: Sunday, 9 November 2025 16.23
> 
> If I remember correctly we should avoid function names in summary.
> 
> (I'm really surprised that callbacks and rte_eth_tx_burst() are
> supposed to reorder mbufs in tx_pkts array, but it looks like it
> the only sensible option.)

Reordering the packets is incompatible with the Stage Ordered Ring.
When the transmit function returns e.g. 2 non-consumed packets, the metadata will be associated with the 2 last packets in the original burst passed to the transmit function, not the 2 packets moved to the last position in the array.

> 
> On 10/30/25 02:54, Stephen Hemminger wrote:
> > When packets are filtered by Tx callback, they return value
> > needs to include those packets; otherwise the application will
> > do a duplicate free of those mbufs.
> >
> > The example would be application that sent a burst of 8 packets,
> > and 2 were filtered but the PMD was busy and could not send any.
> > The result should be a return value of 2 and mbufs 0..1 should
> > be freed and 2..7 should be untouched and up to application to
> > resend or drop.
> >
> > Original buggy code would return 0 and 0..1 would be freed
> > 2..7 would be unused. Application would then try and drop
> > or resend an already freed mbuf.
> >
> > Bugzilla ID: 1816
> > Fixes: 4dc294158cac ("ethdev: support optional Rx and Tx callbacks")
> > Cc: bruce.richardson at intel.com
> > Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
> > ---
> >   lib/ethdev/rte_ethdev.h | 47 ++++++++++++++++++++++++++++++--------
> ---
> >   1 file changed, 35 insertions(+), 12 deletions(-)
> >
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index a66c2abbdb..d2cfe390f9 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -6714,6 +6714,10 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
> queue_id,
> >   	}
> >   #endif
> >
> > +	uint16_t requested_pkts = nb_pkts;
> > +	uint16_t filtered_pkts = 0;
> > +	rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
> RTE_MBUF_HISTORY_OP_TX);
> > +
> >   #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >   	{
> >   		void *cb;
> > @@ -6727,22 +6731,41 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
> queue_id,
> >   		cb = rte_atomic_load_explicit(&p->txq.clbk[queue_id],
> >   				rte_memory_order_relaxed);
> >   		if (unlikely(cb != NULL))
> > -			nb_pkts = rte_eth_call_tx_callbacks(port_id,
> queue_id,
> > -					tx_pkts, nb_pkts, cb);
> > +			filtered_pkts = nb_pkts -
> > +				rte_eth_call_tx_callbacks(port_id, queue_id,
> tx_pkts, nb_pkts, cb);
> >   	}
> >   #endif
> >
> > -	uint16_t requested_pkts = nb_pkts;
> > -	rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
> RTE_MBUF_HISTORY_OP_TX);
> > -
> > -	nb_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts);
> > -
> > -	if (requested_pkts > nb_pkts)
> > -		rte_mbuf_history_mark_bulk(tx_pkts + nb_pkts,
> > -				requested_pkts - nb_pkts,
> RTE_MBUF_HISTORY_OP_TX_BUSY);
> > +	uint16_t sent_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts -
> filtered_pkts);
> > +	uint16_t result_pkts = sent_pkts + filtered_pkts;
> > +
> > +	if (unlikely(result_pkts < requested_pkts)) {
> 
> First of all unlikely() is wrong here. It is not an error path. It is
> OK
> when not everything is sent and it should not cost too much as unlikely
> does.
> Second, I'd remove the condition completely. It is not a big deal to
> calculate unsent_pkts and call trace (if enabled) with zero
> unsent_pkts.
> 
> > +		uint16_t unsent_pkts = requested_pkts - result_pkts;
> > +
> > +		if (unlikely(filtered_pkts > 0)) {
> 
> Same here for unlikely().
> Also since filtered_pkts appear in RTE_ETHDEV_RXTX_CALLBACKS case,
> I'd put it under corresponding conditional compilation to be
> consistent.
> Yes, code with conditional compilation is hard to read, but consistency
> is more important here.
> 
> > +			/* Need to reshuffle packet list so that unsent
> packets are
> > +			 * after the return value
> > +			 *
> > +			 * Original: MMMMMM
> > +			 * Filter:   MMMMMF filtered = 1
> > +			 * Tx_burst: SSSUUF sent = 3 unsent = 2
> > +			 *
> > +			 * Return:   SSSFUU
> > +			 *   return from tx_burst is 4
> > +			 */
> > +			memmove(tx_pkts + result_pkts, tx_pkts + sent_pkts,
> > +				unsent_pkts * sizeof(struct rte_mbuf *));
> > +
> > +			/* Tx filter process does not drop the packets until
> here */
> > +			rte_pktmbuf_free_bulk(tx_pkts + sent_pkts,
> filtered_pkts);

NAK.
If the callback consumes the packet, the callback must free it too. The callback could consume the packet for other purposes, e.g. redirection, so freeing it here would introduce use-after-free bugs.

> > +		}
> > +
> > +		rte_mbuf_history_mark_bulk(tx_pkts + result_pkts,
> unsent_pkts,
> > +					   RTE_MBUF_HISTORY_OP_TX_BUSY);
> > +	}
> >
> > -	rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
> nb_pkts);
> > -	return nb_pkts;
> > +	rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
> result_pkts);
> > +	return result_pkts;
> >   }
> >
> >   /**

IMO, letting a Tx callback change the number of packets to actually transmit has too many side effects.
E.g. port packet counters are not updated accordingly if a callback consumes a packet and silently drops it.

The Ethdev Tx API is not designed for the ethdev layer to handle any of retransmit-or-drop (when insufficient descriptors), redirect, mirror, drop (redirect/mirror/drop of all packets or selectively based on e.g. VLAN or BPF filter).

IMO, we should update the callback API to reflect this, by changing the callback return type to void.



More information about the dev mailing list