[dpdk-users] eventdev performance
Van Haaren, Harry
harry.van.haaren at intel.com
Tue Aug 7 10:34:56 CEST 2018
Hi Tony,
> -----Original Message-----
> From: users [mailto:users-bounces at dpdk.org] On Behalf Of Anthony Hart
> Sent: Sunday, August 5, 2018 8:03 PM
> To: users at dpdk.org
> Subject: [dpdk-users] eventdev performance
>
> I’ve been doing some performance measurements with the eventdev_pipeline
> example application (to see how the eventdev library performs - dpdk 18.05)
> and I’m looking for some help in determining where the bottlenecks are in my
> testing.
If you have the "perf top" tool available, it is very useful in printing statistics
of where CPU cycles are spent during runtime. I use it regularly to identify
bottlenecks in the code for specific lcores.
> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device). In
> this configuration performance tops out with 3 workers (6 cores total) and
> adding more workers actually causes a reduction in throughput. In my setup
> this is about 12Mpps. The same setup running testpmd will reach >25Mpps
> using only 1 core.
Raw forwarding of a packet is less work than forwarding and load-balancing
across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.
> This is the eventdev command line.
> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
> w70 -s1 -n0 -c128 -W0 -D
The -W0 indicates to perform zero cycles of work on each worker core.
This makes each of the 3 worker cores very fast in returning work to the
scheduler core, and puts extra pressure on the scheduler. Note that in a
real-world use-case you presumably want to do work on each of the worker
cores, so the command above (while valid for understanding how it works,
and performance of certain things) is not expected to be used in production.
I'm not sure how familiar you are with CPU caches, but it is worth understanding
that reading this "locally" from L1 or L2 cache is very fast compared to
communicating with another core.
Given that with -W0 the worker cores are very fast, the scheduler can rarely
read data locally - it always has to communicate with other cores.
Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
per event mimic doing actual work on each event.
> This is the tested command line.
> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
> port-topology=loop
>
>
> I’m guessing that its either the RX or Sched that’s the bottleneck in my
> eventdev_pipeline setup.
Given that you state testpmd is capable of forwarding at >25 mpps on your
platform it is safe to rule out RX, since testpmd is performing the RX in
that forwarding workload.
Which leaves the scheduler - and indeed the scheduler is probably what is
the limiting factor in this case.
> So I first tried to use 2 cores for RX (-r6), performance went down. It
> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
> access to that one ring is alternated between the two cores? So that
> doesn’t help.
Correct - it is invalid to use two CPU cores on a single RX queue without
some form of serialization (otherwise it causes race-conditions). The
eventdev_pipeline sample app helpfully provides that - but there is a performance
impact on doing so. Using two RX threads on a single RX queue is generally
not recommended.
> Next, I could use 2 scheduler cores, but how does that work, do they again
> alternate? In any case throughput is reduced by 50% in that test.
Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
to run it at the same time, and hence the serialization is in place to ensure
that the results are valid.
> thanks for any insights,
> tony
Try the suggestion above of adding work to the worker cores - this should
"balance out" the current scheduling bottleneck, and place some more on
each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
Apart from that, I should try to understand your intended use better.
Is this an academic investigation into the performance, or do you have
specific goals in mind? Is dynamic load-balancing as the event_sw provides
required, or would a simpler (and hence possibly more performant) method suffice?
Regards, -Harry
More information about the users
mailing list