[dpdk-dev] Application used for DSW event_dev performance testing

Venky Venkatesh vvenkatesh at paloaltonetworks.com
Tue Nov 27 23:33:41 CET 2018


On 11/14/18, 9:46 PM, "Mattias Rönnblom" <mattias.ronnblom at ericsson.com> wrote:
>
>
> On 11/14/18, 9:46 PM, "Mattias Rönnblom" <mattias.ronnblom at ericsson.com> wrote:
>
>     On 2018-11-14 22:56, Venky Venkatesh wrote:
>     > Mattias,
>     > Thanks for the prompt response. Appreciate your situation of not being able to share the proprietary code. More answers inline as [VV]:
>     > --Venky
>     >
>     > On 11/14/18, 11:41 AM, "Mattias Rönnblom" <hofors at lysator.liu.se> wrote:
>     >
>     >      On 2018-11-14 20:16, Venky Venkatesh wrote:
>     >      > Hi,
>     >      >
>     >      > https://urldefense.proofpoint.com/v2/url?u=https-3A__mails.dpdk.org_archives_dev_2018-2DSeptember_111344.html&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=w2W5SR0mU5u5mz008DZNCsexDN1Lr9bpL7ZGKuD0Zd4&m=H4I6cuKi4kKoypKWz8mjDoXLGgkSNurKbKXrq4qJs5A&s=AD0KG106hPreSKeTQMRzDPwnEfBR9oD6dtjpL2Plt4c&e= mentions that there is a sample application where “worker cores can sustain 300-400 million event/s. With a pipeline
>     >      > with 1000 clock cycles of work per stage, the average event device
>     >      > overhead is somewhere 50-150 clock cycles/event”. Is this sample application code available?
>     >      >
>     >      It's proprietary code, although it's also been tested by some of our
>     >      partners.
>     >
>     >      The primary reason for it not being contributed to DPDK is because it's
>     >      a fair amount of work to do so. I would refer to it as an eventdev
>     >      pipeline simulator, rather than a sample app.
>     >
>     >      > We have written a similar simple sample application where 1 core keeps enqueuing (as NEW/ATOMIC) and n-cores dequeue (and RELEASE) and do no other work. But we are not seeing anything close in terms of performance. Also we are seeing some counter intuitive behaviors such as a burst of 32 is worse than burst of 1. We surely have something wrong and would thus compare against a good application that you have written. Could you pls share it?
>     >      >
>     >
>     >      Is this enqueue or dequeue burst? How large is n? Is this explicit release?
>     >   [VV]: Yes both are burst of 32. I tried n=4-7. It is explicit RELEASE.
>     >
>
>     If you want good scheduler throughput, don't do explicit release. With
>     other event devices, and heavy-weight pipelines, there might be a point
>     to do so because the released event's flow could potentially be
>     scheduled on other cores. However, on DSW migration won't happen until
>     the application has finished processing its burst, at the earliest.
>
I am getting ~25M events/sec on a single core dequeue (i.e. n=1) with no additional work after dequeue. I then introduced some work after dequeue and the performance falls steeply to 1M events/sec! Profiling seems to indicate all the time in our work -- though the work is just (where all of these are stack variables):
                for (k = 0; k < n1; k++)
                    for (j=0; j < n2; j ++)
                        temp += access[j];
Profiling also indicates that it is not memory bound. 

The bigger surprise was when I moved to multiple cores (i.e. n > 1). E.g. when I used n=2-5 I was expecting to do better than the 1M events/sec of single core. Instead it had 0.53M, 0.32M, 0.20M, 0.13M respectively. Thus it is decreasing with adding cores rather steeply!
BTW I have removed the explicit RELEASE ENQ.  All I do in the worker_loop function is the following 

        int dev_id = 0, i,j, k;
        struct rte_event ev[BURST];
        int access[1024], temp;
        int n_evs = rte_event_dequeue_burst(dev_id, port, ev, BURST, 0);
        for (i = 0; i < n_evs; i++) {
            /* do work for the event, set ev to next eventdev queue */
            switch (ev[i].queue_id) {
            case DEMO_STAGE_TX:
                for (k = 0; k < n1; k++)
                    for (j=0; j < n2; j ++)
                        temp += access[j];
                ev[i].op = RTE_EVENT_OP_RELEASE; // though I don’t call event_enqueue after this.
                break;
            default:
                printf("invalid q_id:%d\n", ev[i].queue_id);
                break;
            }
        }

The amount of work is parameterized by n1 and n2. This is basically accessing upto 4K bytes contiguously repeatedly. I am using n1=10 and n2=1024.

However profiling with multiple cores shows that a lot of time is being spent in dsw related work.  Specifically:
dsw_port_transmit_buffered: 25%
dsw_port_flush_out_buffers: 9.8%
dsw_port_ctl_process: 6.3%
dsw_port_consider_migration: 6.3%
dsw_event_dequeue_burst: 6.3%
Real worker: 15%
There are a whole bunch of other dsw things taking about 3% each.

As you can see the DSW overhead dominates the scene and very little real work is getting done. Is there some configuration or tuning to be done to get the sort of performance you are seeing with multiple cores?

One consistent observation however is that the dev_credits_on_loan is pretty close to new_event_threshold of 4K.

>     DSW does buffer on enqueue, so large enqueue bursts doesn't improve
>     performance much. They should not decrease performance, unless you go
>     above the configured max burst.
>
>     >      What do you set nb_events_limit to? Good DSW performance much depends on
>     >      the average burst size on the event rings, which in turn is dependent on
>     >      the number of in-flight events. On really high core-count systems you
>     >      might also want to increase DSW_MAX_PORT_OPS_PER_BG_TASK, since it
>     >      effectively puts a limit on the maximum number of events buffered on the
>     >      output buffers.
>     > [VV]:         struct rte_event_dev_config config = {
>     >                          .nb_event_queues = 2,
>     >                          .nb_event_ports = 5,
>     >                          .nb_events_limit  = 4096,
>     >                          .nb_event_queue_flows = 1024,
>     >                          .nb_event_port_dequeue_depth = 128,
>     >                          .nb_event_port_enqueue_depth = 128,
>     >          }; >          struct rte_event_port_conf p_conf = {
>     >                          .dequeue_depth = 64,
>     >                          .enqueue_depth = 64,
>     >                          .new_event_threshold = 1024,
>
>     "new_event_threshold" effectively puts a limit on the number of inflight
>     events. You should increase this to something close to "nb_events_limit".
>
>     >                          .disable_implicit_release = 0,
>     >          };
>     >          struct rte_event_queue_conf q_conf = {
>     >                          .schedule_type = RTE_SCHED_TYPE_ATOMIC,
>     >                          .priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
>     >                          .nb_atomic_flows = 1024,
>     >                          .nb_atomic_order_sequences = 1024,
>     >          };
>
>     >
>     >      In the pipeline simulator all cores produce events initially, and then
>     >      recycles events when the number of in-flight events reach a certain
>     >      threshold (50% of nb_events_limit). A single lcore won't be able to fill
>     >      the pipeline, if you have zero-work stages.
>     > [VV]: I have a single NEW event enqueue thread(0) and a bunch of “dequeue and RELEASE” threads (1-4) – simple case. I have a stats print thread(5) as well. If the 1 enqueue thread is unable to fill the pipeline, what counter would indicate that? I see the contrary effect -- I am tracking the number of times enqueue fails and that number is large.
>     >
>     >
>     There's no counter for failed enqueues, although maybe there should be.
>     "dev_credits_on_load" can be seen as an estimate of how many events are
>     currently inflight in the scheduler. If this number is close to your
>     "new_event_threshold", the pipeline is busy. If it's low, in the
>     couple-of-hundreds range, your pipeline is likely not-so-busy (even
>     idle) because not enough events are being fed into it.
>
>     You can obviously detect failed NEW enqueues in the application as well.
>
>     I'm not sure exactly how much one core can produce, and it obviously
>     depends on what kind of core, but it's certainly a lot lower than
>     "300-400 millions events/s". Maybe something like 40-50 Mevents/s.
>
>     What is your flow id distribution? As in, how many flow ids are you
>     actively using in the events are you feeding the different
>     queues/pipeline stages.
>

I am using a 2.1Ghz Xeon Silver. 16 cores with hyper threading (so 32 threads). As indicated above the performance isn’t increasing with cores. I am using 10000 flow ids using rte_rand()%10000.

>     >      Even though I can't send you the simulator code at this point, I'm happy
>     >      to assist you in any DSW-related endeavors.
>     > [VV]: My program is a simple enough program (nothing proprietary) that I can share. Can I unicast it to you for a quick recommendation?
>     >
>
>     Sure, although I prefer to have any discussions on the mailing list, so
>     other users can learn from your experiences.
>
>     Btw, you really need to get a proper mail user agent, or configure the
>     one you have to quote messages as per normal convention.
> 



More information about the dev mailing list