[dpdk-users] eventdev performance

Anthony Hart ahart at domainhart.com
Thu Aug 9 17:56:07 CEST 2018


Hi Harry,
Thanks for the reply, please see responses inline

> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren at intel.com> wrote:
> 
> Hi Tony,
> 
>> -----Original Message-----
>> From: users [mailto:users-bounces at dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users at dpdk.org
>> Subject: [dpdk-users] eventdev performance
>> 
>> I’ve been doing some performance measurements with the eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk 18.05)
>> and I’m looking for some help in determining where the bottlenecks are in my
>> testing.
> 
> If you have the "perf top" tool available, it is very useful in printing statistics
> of where CPU cycles are spent during runtime. I use it regularly to identify
> bottlenecks in the code for specific lcores.

Yes I have perf if there is something you’d like to see I can post it.  

> 
> 
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In
>> this configuration performance tops out with 3 workers (6 cores total) and
>> adding more workers actually causes a reduction in throughput.   In my setup
>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>> using only 1 core.
> 
> Raw forwarding of a packet is less work than forwarding and load-balancing
> across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.

ok.  

> 
> 
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
>> w70 -s1 -n0 -c128 -W0 -D
> 
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to the
> scheduler core, and puts extra pressure on the scheduler. Note that in a
> real-world use-case you presumably want to do work on each of the worker
> cores, so the command above (while valid for understanding how it works,
> and performance of certain things) is not expected to be used in production.
> 
> I'm not sure how familiar you are with CPU caches, but it is worth understanding
> that reading this "locally" from L1 or L2 cache is very fast compared to
> communicating with another core.
> 
> Given that with -W0 the worker cores are very fast, the scheduler can rarely
> read data locally - it always has to communicate with other cores.
> 
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
> per event mimic doing actual work on each event. 

Adding work with -W reduces performance.

I modify eventdev_pipeline to print the contents of rte_event_eth_rx_adapter_stats for the device.  In particular I print the rx_enq_retry and rx_poll_count values for the receive thread.    Once I get to a load level where packets are dropped I see that the number of retires equals or exceeds the poll count (as I increase the load the retries exceeds the poll count).

I think this indicates that the Scheduler is not keeping up.  That could be (I assume) because the workers are not consuming fast enough.  However if I increase the number of workers then the ratio of retry to poll_count (in the rx thread) goes up, for example adding 4 more workers and the retries:poll ration becomes 5:1

Seems like this is indicating that the Scheduler is the bottleneck?


> 
> 
>> This is the tested command line.
>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>> port-topology=loop
>> 
>> 
>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>> eventdev_pipeline setup.
> 
> Given that you state testpmd is capable of forwarding at >25 mpps on your
> platform it is safe to rule out RX, since testpmd is performing the RX in
> that forwarding workload.
> 
> Which leaves the scheduler - and indeed the scheduler is probably what is
> the limiting factor in this case.

yes seems so.

> 
> 
>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
>> access to that one ring is alternated between the two cores?    So that
>> doesn’t help.
> 
> Correct - it is invalid to use two CPU cores on a single RX queue without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a performance
> impact on doing so. Using two RX threads on a single RX queue is generally
> not recommended.
> 
> 
>> Next, I could use 2 scheduler cores,  but how does that work, do they again
>> alternate?   In any case throughput is reduced by 50% in that test.
> 
> Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
> to run it at the same time, and hence the serialization is in place to ensure
> that the results are valid.
> 
> 
>> thanks for any insights,
>> tony
> 
> Try the suggestion above of adding work to the worker cores - this should
> "balance out" the current scheduling bottleneck, and place some more on
> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
> 
> Apart from that, I should try to understand your intended use better.
> Is this an academic investigation into the performance, or do you have
> specific goals in mind? Is dynamic load-balancing as the event_sw provides
> required, or would a simpler (and hence possibly more performant) method suffice?
> 

Our current app uses the standard testpmd style of each core does rx->work->tx, the packets are spread across the cores using RSS in the ethernet device.   This works fine provided the traffic is diverse.  Elephant flows are a problem though, so we’d like the option of distributing the packets in the way that eventdev_pipline -p does (yes I understand implications with reordering).   So eventdev looks interesting.   So I was trying to get an idea of what the performance implication would be in using eventdev.



> Regards, -Harry



More information about the users mailing list