dpdk Tx falling short
Ivan Malov
ivan.malov at arknetworks.am
Fri Jul 4 13:44:50 CEST 2025
Hi Ed,
You say there is only one mempool. Why?
Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?
Thank you.
On Thu, 3 Jul 2025, Lombardo, Ed wrote:
>
> Hi,
>
> I have run out of ideas and thought I would reach out to the dpdk community.
>
>
>
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs. When our application pipeline final stage enqueues mbufs into the tx ring I expect the
> rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate. What I see is when there is one interface receiving 64-byte UDP in IPv4
> the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
>
> When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS. The Tx ring is filling faster than Tx thread can dequeue and transmit
> mbufs.
>
>
>
> Packets arrive on ports 1 and 3 in my test setup. NIC is on NUMA Node 1. Hugepage memory (6GB, 1GB page size) is on NUMA Node 1. The mbuf size is 9KB.
>
>
>
> Rx Port 1 -> Tx Port 2
>
> Rx Port 3 -> Tx port 4
>
>
>
> I monitor the mbufs available and they are:
>
> *** DPDK Mempool Configuration ***
>
> Number Sockets : 1
>
> Memory/Socket GB : 6
>
> Hugepage Size MB : 1024
>
> Overhead/socket MB : 512
>
> Usable mem/socket MB: 5629
>
> mbuf size Bytes : 9216
>
> nb mbufs per socket : 640455
>
> total nb mbufs : 640455
>
> hugepages/socket GB : 6
>
> mempool cache size : 512
>
>
>
> *** DPDK EAL args ***
>
> EAL lcore arg : -l 36 <<< NUMA Node 1
>
> EAL socket-mem arg : --socket-mem=0,6144
>
>
>
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
>
>
>
> The Tx rings are created as SP and SC when created.
>
>
>
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs. The dequeue burst size is 512 and tx
> burst is equal to or less than 512. The rte_eth_tx_burst() never returns less than the bust size given.
>
>
>
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
>
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads. HTOP shows the Tx threads are the only threads using the carved-out CPUs.
>
>
>
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.
>
> I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512. The dequeue of
> the tx ring is always 512, never less.
>
>
>
>
>
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free
> the mbufs faster than they arrive on the tx ring.
>
>
>
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
>
>
>
>
>
>
>
> Any help to resolve my issue is greatly appreciated.
>
>
>
> Thanks,
>
> Ed
>
>
>
>
>
>
>
>
>
More information about the users
mailing list