[RFC] ethdev: fix UAF in rte_eth_tx_burst with filtering

Andrew Rybchenko andrew.rybchenko at oktetlabs.ru
Mon Nov 10 10:02:15 CET 2025


On 11/10/25 11:34, Morten Brørup wrote:
> +TO: Konstantin, regarding Stage Ordered Ring.
> 
>> From: Andrew Rybchenko [mailto:andrew.rybchenko at oktetlabs.ru]
>> Sent: Sunday, 9 November 2025 16.23
>>
>> If I remember correctly we should avoid function names in summary.
>>
>> (I'm really surprised that callbacks and rte_eth_tx_burst() are
>> supposed to reorder mbufs in tx_pkts array, but it looks like it
>> the only sensible option.)
> 
> Reordering the packets is incompatible with the Stage Ordered Ring.
> When the transmit function returns e.g. 2 non-consumed packets, the metadata will be associated with the 2 last packets in the original burst passed to the transmit function, not the 2 packets moved to the last position in the array.

Good point.

> 
>>
>> On 10/30/25 02:54, Stephen Hemminger wrote:
>>> When packets are filtered by Tx callback, they return value
>>> needs to include those packets; otherwise the application will
>>> do a duplicate free of those mbufs.
>>>
>>> The example would be application that sent a burst of 8 packets,
>>> and 2 were filtered but the PMD was busy and could not send any.
>>> The result should be a return value of 2 and mbufs 0..1 should
>>> be freed and 2..7 should be untouched and up to application to
>>> resend or drop.
>>>
>>> Original buggy code would return 0 and 0..1 would be freed
>>> 2..7 would be unused. Application would then try and drop
>>> or resend an already freed mbuf.
>>>
>>> Bugzilla ID: 1816
>>> Fixes: 4dc294158cac ("ethdev: support optional Rx and Tx callbacks")
>>> Cc: bruce.richardson at intel.com
>>> Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
>>> ---
>>>    lib/ethdev/rte_ethdev.h | 47 ++++++++++++++++++++++++++++++--------
>> ---
>>>    1 file changed, 35 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>> index a66c2abbdb..d2cfe390f9 100644
>>> --- a/lib/ethdev/rte_ethdev.h
>>> +++ b/lib/ethdev/rte_ethdev.h
>>> @@ -6714,6 +6714,10 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
>> queue_id,
>>>    	}
>>>    #endif
>>>
>>> +	uint16_t requested_pkts = nb_pkts;
>>> +	uint16_t filtered_pkts = 0;
>>> +	rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX);
>>> +
>>>    #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>>>    	{
>>>    		void *cb;
>>> @@ -6727,22 +6731,41 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
>> queue_id,
>>>    		cb = rte_atomic_load_explicit(&p->txq.clbk[queue_id],
>>>    				rte_memory_order_relaxed);
>>>    		if (unlikely(cb != NULL))
>>> -			nb_pkts = rte_eth_call_tx_callbacks(port_id,
>> queue_id,
>>> -					tx_pkts, nb_pkts, cb);
>>> +			filtered_pkts = nb_pkts -
>>> +				rte_eth_call_tx_callbacks(port_id, queue_id,
>> tx_pkts, nb_pkts, cb);
>>>    	}
>>>    #endif
>>>
>>> -	uint16_t requested_pkts = nb_pkts;
>>> -	rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX);
>>> -
>>> -	nb_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts);
>>> -
>>> -	if (requested_pkts > nb_pkts)
>>> -		rte_mbuf_history_mark_bulk(tx_pkts + nb_pkts,
>>> -				requested_pkts - nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX_BUSY);
>>> +	uint16_t sent_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts -
>> filtered_pkts);
>>> +	uint16_t result_pkts = sent_pkts + filtered_pkts;
>>> +
>>> +	if (unlikely(result_pkts < requested_pkts)) {
>>
>> First of all unlikely() is wrong here. It is not an error path. It is
>> OK
>> when not everything is sent and it should not cost too much as unlikely
>> does.
>> Second, I'd remove the condition completely. It is not a big deal to
>> calculate unsent_pkts and call trace (if enabled) with zero
>> unsent_pkts.
>>
>>> +		uint16_t unsent_pkts = requested_pkts - result_pkts;
>>> +
>>> +		if (unlikely(filtered_pkts > 0)) {
>>
>> Same here for unlikely().
>> Also since filtered_pkts appear in RTE_ETHDEV_RXTX_CALLBACKS case,
>> I'd put it under corresponding conditional compilation to be
>> consistent.
>> Yes, code with conditional compilation is hard to read, but consistency
>> is more important here.
>>
>>> +			/* Need to reshuffle packet list so that unsent
>> packets are
>>> +			 * after the return value
>>> +			 *
>>> +			 * Original: MMMMMM
>>> +			 * Filter:   MMMMMF filtered = 1
>>> +			 * Tx_burst: SSSUUF sent = 3 unsent = 2
>>> +			 *
>>> +			 * Return:   SSSFUU
>>> +			 *   return from tx_burst is 4
>>> +			 */
>>> +			memmove(tx_pkts + result_pkts, tx_pkts + sent_pkts,
>>> +				unsent_pkts * sizeof(struct rte_mbuf *));
>>> +
>>> +			/* Tx filter process does not drop the packets until
>> here */
>>> +			rte_pktmbuf_free_bulk(tx_pkts + sent_pkts,
>> filtered_pkts);
> 
> NAK.
> If the callback consumes the packet, the callback must free it too. The callback could consume the packet for other purposes, e.g. redirection, so freeing it here would introduce use-after-free bugs.

Good point as well. However, it still requires reordering which has
problem mentioned above.

> 
>>> +		}
>>> +
>>> +		rte_mbuf_history_mark_bulk(tx_pkts + result_pkts,
>> unsent_pkts,
>>> +					   RTE_MBUF_HISTORY_OP_TX_BUSY);
>>> +	}
>>>
>>> -	rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
>> nb_pkts);
>>> -	return nb_pkts;
>>> +	rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
>> result_pkts);
>>> +	return result_pkts;
>>>    }
>>>
>>>    /**
> 
> IMO, letting a Tx callback change the number of packets to actually transmit has too many side effects.
> E.g. port packet counters are not updated accordingly if a callback consumes a packet and silently drops it.
> 
> The Ethdev Tx API is not designed for the ethdev layer to handle any of retransmit-or-drop (when insufficient descriptors), redirect, mirror, drop (redirect/mirror/drop of all packets or selectively based on e.g. VLAN or BPF filter).
> 
> IMO, we should update the callback API to reflect this, by changing the callback return type to void.
> 



More information about the dev mailing list