<div dir="ltr"><div>Hi Ed,</div><div><br></div><div>Did you ran dpdk-testpmd with multiple queue and did you hit line rate. Sapphire rapid is powerful processor, we were able to hit 200Gbps with 14 cores with mellanox CX6 NIC.</div><div><br></div><div>how many core are you using? what is the descriptor size & number of queue ? try playing with that with that.. </div><div><br></div><div>dpdk-testpmd -l 0-36 -a <pci of nic> -- -i -a --nb-cores=35 --txq=14 --rxq=14 --rxd=4096 </div><div><br></div><div>Also try reducing mbuf size to 2K (from the current 9k) and enable jumbo frame support</div><div><br></div><div>try to run "perf top" and see which is taking more time. Also try to cache-align your data-structure.</div><div><br></div><div>struct sample_struct {</div><div> uint32_t a;</div><div> uint64_t b;</div><div>...</div><div>} __rte_cache_aligned;</div><div><br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">Thanks,<br><b>-Rajesh</b><br></div></div></div></div></div><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Fri, Jul 4, 2025 at 3:27 AM Stephen Hemminger <<a href="mailto:stephen@networkplumber.org">stephen@networkplumber.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, 3 Jul 2025 20:14:59 +0000<br> "Lombardo, Ed" <<a href="mailto:Ed.Lombardo@netscout.com" target="_blank">Ed.Lombardo@netscout.com</a>> wrote:<br> <br> > Hi,<br> > I have run out of ideas and thought I would reach out to the dpdk community.<br> > <br> > I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs. When our application pipeline final stage enqueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate. What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).<br> > When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS. The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.<br> > <br> > Packets arrive on ports 1 and 3 in my test setup. NIC is on NUMA Node 1. Hugepage memory (6GB, 1GB page size) is on NUMA Node 1. The mbuf size is 9KB.<br> > <br> > Rx Port 1 -> Tx Port 2<br> > Rx Port 3 -> Tx port 4<br> > <br> > I monitor the mbufs available and they are:<br> > *** DPDK Mempool Configuration ***<br> > Number Sockets : 1<br> > Memory/Socket GB : 6<br> > Hugepage Size MB : 1024<br> > Overhead/socket MB : 512<br> > Usable mem/socket MB: 5629<br> > mbuf size Bytes : 9216<br> > nb mbufs per socket : 640455<br> > total nb mbufs : 640455<br> > hugepages/socket GB : 6<br> > mempool cache size : 512<br> > <br> > *** DPDK EAL args ***<br> > EAL lcore arg : -l 36 <<< NUMA Node 1<br> > EAL socket-mem arg : --socket-mem=0,6144<br> > <br> > The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.<br> > <br> > The Tx rings are created as SP and SC when created.<br> > <br> > There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs. The dequeue burst size is 512 and tx burst is equal to or less than 512. The rte_eth_tx_burst() never returns less than the bust size given.<br> > <br> > Each Tx thread is on a dedicated CPU core and its sibling is unused.<br> > We use cpushielding to keep noncritical threads from using these CPUs for Tx threads. HTOP shows the Tx threads are the only threads using the carved-out CPUs.<br> > <br> > In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.<br> > I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512. The dequeue of the tx ring is always 512, never less.<br> > <br> > <br> > Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.<br> > <br> > So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.<br> > <br> > <br> > <br> > Any help to resolve my issue is greatly appreciated.<br> > <br> > Thanks,<br> > Ed<br> > <br> > <br> > <br> <br> <br> Do profiling, and look at the number of cache misses.<br> I suspect using an additional ring is causing lots of cache misses.<br> Remember going to memory is really slow on modern processors.<br> </blockquote></div>