[dpdk-dev] Rx-errors with testpmd (only 75% line rate)

François-Frédéric Ozog ff at ozog.com
Fri Jan 24 10:18:05 CET 2014

> -----Message d'origine-----
> De : dev [mailto:dev-bounces at dpdk.org] De la part de Michael Quicquaro
> Envoyé : vendredi 24 janvier 2014 00:23
> À : Robert Sanford
> Cc : dev at dpdk.org; mayhan at mayhan.org
> Objet : Re: [dpdk-dev] Rx-errors with testpmd (only 75% line rate)
> Thank you, everyone, for all of your suggestions, but unfortunately I'm
> still having the problem.
> I have reduced the test down to using 2 cores (one is the master core)
> of which are on the socket in which the NIC's PCI slot is connected.  I am
> running in rxonly mode, so I am basically just counting the packets.  I've
> tried all different burst sizes.  Nothing seems to make any difference.
> Since my original post, I have acquired an IXIA tester so I have better
> control over my testing.   I send 250,000,000 packets to the interface.  I
> am getting roughly 25,000,000 Rx-errors with every run.  I have verified
> that the number of Rx-errors is consistent in the value in the RXMPC of
> NIC.
> Just for sanity's sake, I tried switching the cores to the other socket
> run the same test.  As expected I got more packet loss.  Roughly
> I am running Red Hat 6.4 which uses kernel 2.6.32-358
> This is a numa supported system, but whether or not I use --numa doesn't
> seem to make a difference.

Is the BIOS configured NUMA? If not, the BIOS may program System Address
Decoding so that memory address space is interleaved between sockets on 64MB
boundaries (you may have a look at Xeon 7500 datasheet volume 2 - a public
document - §4.4 for an "explanation" of this). 

In general you don't want memory interleaving: QPI bandwidth tops at 16GBps
on the latest processors while single node aggregated memory bandwidth can
be over 60GB/s.

> Looking at the Intel documentation it appears that I should be able to
> easily do what I am trying to do.  Actually, the documentation infers that
> I should be able to do roughly 40 Gbps with a single 2.x GHz processor
> with other configuration (memory, os, etc.) similar to my system.  It
> appears to me that much of the details of these benchmarks are missing.
> Can someone on this list actually verify for me that what I am trying to
> is possible and that they have done it with success?

I have done a NAT64 proof of concept that handled 40Gbps throughput on a
single Xeon E5 2697v2.
Intel NIC chip was 82599ES (if I recall correctly, I don't have the card
handy anymore), 4 rx queues 4 tx queues per port, 32768 descriptors per
queue, Intel DCA on, Ethernet pause parameters OFF: 14.8Mpps per port, no
packet loss.
However this was with a kernel based proprietary packet framework. I expect
DPDK to achieve the same results.

> Much appreciation for all the help.
> - Michael
> On Wed, Jan 22, 2014 at 3:38 PM, Robert Sanford
> <rsanford at prolexic.com>wrote:
> > Hi Michael,
> >
> > > What can I do to trace down this problem?
> >
> > May I suggest that you try to be more selective in the core masks on
> > the command line. The test app may choose some cores from "other" CPU
> sockets.
> > Only enable cores of the one socket to which the NIC is attached.
> >
> >
> > > It seems very similar to a
> > > thread on this list back in May titled "Best example for showing
> > > throughput?" where no resolution was ever mentioned in the thread.
> >
> > After re-reading *that* thread, it appears that their problem may have
> > been trying to achieve ~40 Gbits/s of bandwidth (2 ports x 10 Gb Rx +
> > 2 ports x 10 Gb Tx), plus overhead, over a typical dual-port NIC whose
> > total bus bandwidth is a maximum of 32 Gbits/s (PCI express 2.1 x8).

PCIe is "32Gbps" full duplex, meaning on each direction.
On a single dual port card you have 20Gbps inbound traffic (below 32Gbps)
and 20Gbps outbound traffic (below 32Gbps).

A 10Gbos port running at  10,000,000,000bps (10^10bps, *not* a power of
two). A 64 byte frame (incl. CRC) has preamble, interframe gap... So on the
wire there are 
7+1+64+12=84bytes=672bits. The max packet rate is thus 10^10 / 672 =
14,880,952 pps.

On the PCIexpress side there will be 60 byte (frame excluding CRC)
transferred in a single DMA transaction with additional overhead, plus
8b/10b encoding per packet:
(60 + 8 + 16) = 84 bytes (fits into a 128 byte typical max payload) or 840
'bits' (8b/10b encoding). I 
An 8 lane 5GT/s (GigaTransaction = 5*10^10 "transaction" per second; i.e. a
"bit" every 200picosecond) can be viewed as a 40GT/s link, so we can have
4*10^10/840=47,619,047pps per direction (PCIe is full duplex).

So two fully loaded ports generate 29,761,904pps on each direction which can
be absorbed on the PCIexpress Gen x8 even taking account overhead of DMA

> >
> > --
> > Regards,
> > Robert
> >
> >

More information about the dev mailing list