[dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app

Jerin Jacob jerin.jacob at caviumnetworks.com
Tue Jun 27 11:35:10 CEST 2017
Previous message: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app
Next message: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
-----Original Message-----
> Date: Mon, 26 Jun 2017 15:46:47 +0100
> From: "Hunt, David" <david.hunt at intel.com>
> To: Jerin Jacob <jerin.jacob at caviumnetworks.com>, Harry van Haaren
>  <harry.van.haaren at intel.com>
> CC: dev at dpdk.org, Gage Eads <gage.eads at intel.com>, Bruce Richardson
>  <bruce.richardson at intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added
>  sample app
> User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
>  Thunderbird/45.8.0
> 
> Hi Jerin,

Hi David,

Looks like you have sent the old version. The below mentioned comments
are not addressed in v2.
> 
> I'm assisting Harry on the sample app, and have just pushed up a V2 patch
> based on your feedback. I've addressed most of your suggestions, comments
> below. There may still a couple of outstanding questions that need further
> discussion.

A few general comments:
1) Nikhil/Gage's proposal on ethdev rx to eventdev adapter will change the major
portion(eventdev setup and producer()) of this application
2) Producing one lcore worth of packets, really cant show as example
eventdev application as it will be pretty bad in low-end machine.
At least application infrastructure should not limit.

Considering above points, Should we wait for rx adapter to complete
first? I would like to show this as real world application to use eventdev.

Thoughts?

On the same note:
Can we finalize on rx adapter proposal? I can work on v1 of patch and
common code if Nikhil or Gage don't have bandwidth. Let me know?

last followup:
http://dpdk.org/ml/archives/dev/2017-June/068776.html

> 
> Regards,
> Dave
> 
> On 10/5/2017 3:12 PM, Jerin Jacob wrote:
> > -----Original Message-----
> >> Date: Fri, 21 Apr 2017 10:51:37 +0100
> >> From: Harry van Haaren <harry.van.haaren at intel.com>
> >> To: dev at dpdk.org
> >> CC: jerin.jacob at caviumnetworks.com, Harry van Haaren
> >> <harry.van.haaren at intel.com>, Gage Eads <gage.eads at intel.com>, Bruce
> >>  Richardson <bruce.richardson at intel.com>
> >> Subject: [PATCH 1/3] examples/eventdev_pipeline: added sample app
> >> X-Mailer: git-send-email 2.7.4
> >>
> >> This commit adds a sample app for the eventdev library.
> >> The app has been tested with DPDK 17.05-rc2, hence this
> >> release (or later) is recommended.
> >>
> >> The sample app showcases a pipeline processing use-case,
> >> with event scheduling and processing defined per stage.
> >> The application recieves traffic as normal, with each
> >> packet traversing the pipeline. Once the packet has
> >> been processed by each of the pipeline stages, it is
> >> transmitted again.
> >>
> >> The app provides a framework to utilize cores for a single
> >> role or multiple roles. Examples of roles are the RX core,
> >> TX core, Scheduling core (in the case of the event/sw PMD),
> >> and worker cores.
> >>
> >> Various flags are available to configure numbers of stages,
> >> cycles of work at each stage, type of scheduling, number of
> >> worker cores, queue depths etc. For a full explaination,
> >> please refer to the documentation.
> >>
> >> Signed-off-by: Gage Eads <gage.eads at intel.com>
> >> Signed-off-by: Bruce Richardson <bruce.richardson at intel.com>
> >> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> >
> > Thanks for the example application to share the SW view.
> > I could make it run on HW after some tweaking(not optimized though)
> >
> > [...]
> >> +#define MAX_NUM_STAGES 8
> >> +#define BATCH_SIZE 16
> >> +#define MAX_NUM_CORE 64
> >
> > How about RTE_MAX_LCORE?
> 
> Core usage in the sample app is held in a uint64_t. Adding arrays would be
> possible, but I feel that the extra effort would not give that much benefit.
> I've left as is for the moment, unless you see any strong requirement to go
> beyond 64 cores?

I think, it is OK. Again with service core infrastructure this will change.

> 
> >
> >> +
> >> +static unsigned int active_cores;
> >> +static unsigned int num_workers;
> >> +static unsigned long num_packets = (1L << 25); /* do ~32M packets */
> >> +static unsigned int num_fids = 512;
> >> +static unsigned int num_priorities = 1;
> >
> > looks like its not used.
> 
> Yes, Removed.
> 
> >
> >> +static unsigned int num_stages = 1;
> >> +static unsigned int worker_cq_depth = 16;
> >> +static int queue_type = RTE_EVENT_QUEUE_CFG_ATOMIC_ONLY;
> >> +static int16_t next_qid[MAX_NUM_STAGES+1] = {-1};
> >> +static int16_t qid[MAX_NUM_STAGES] = {-1};
> >
> > Moving all fastpath related variables under a structure with cache
> > aligned will help.
> 
> I tried a few different combinations of this, and saw no gains, some losses.
> So will leave as is for the moment, if that's OK.

I think, the one are using in fastpath better to allocate from huge page
using rte_malloc()

> 
> >
> >> +static int worker_cycles;
> >> +static int enable_queue_priorities;
> >> +
> >> +struct prod_data {
> >> +    uint8_t dev_id;
> >> +    uint8_t port_id;
> >> +    int32_t qid;
> >> +    unsigned num_nic_ports;
> >> +};
> 
> Yes, saw a percent or two gain when this plus following two data structs
> cache aligned.

looks like it not fixed in v2. Looks like you have sent the old
version.

> 
> >
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int
> >> +producer(void)
> >> +{
> >> +    static uint8_t eth_port;
> >> +    struct rte_mbuf *mbufs[BATCH_SIZE];
> >> +    struct rte_event ev[BATCH_SIZE];
> >> +    uint32_t i, num_ports = prod_data.num_nic_ports;
> >> +    int32_t qid = prod_data.qid;
> >> +    uint8_t dev_id = prod_data.dev_id;
> >> +    uint8_t port_id = prod_data.port_id;
> >> +    uint32_t prio_idx = 0;
> >> +
> >> +    const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs,
> BATCH_SIZE);
> >> +    if (++eth_port == num_ports)
> >> +        eth_port = 0;
> >> +    if (nb_rx == 0) {
> >> +        rte_pause();
> >> +        return 0;
> >> +    }
> >> +
> >> +    for (i = 0; i < nb_rx; i++) {
> >> +        ev[i].flow_id = mbufs[i]->hash.rss;
> >
> > prefetching the buff[i+1] may help here?
> 
> I tried, didn't make much difference.

OK.

> 
> >
> >> +        ev[i].op = RTE_EVENT_OP_NEW;
> >> +        ev[i].sched_type = queue_type;
> >
> > The value of RTE_EVENT_QUEUE_CFG_ORDERED_ONLY != RTE_SCHED_TYPE_ORDERED.
> So, we
> > cannot assign .sched_type as queue_type.
> >
> > I think, one option could be to avoid translation in application is to
> > - Remove RTE_EVENT_QUEUE_CFG_ALL_TYPES, RTE_EVENT_QUEUE_CFG_*_ONLY
> > - Introduce a new RTE_EVENT_DEV_CAP_ to denote
> RTE_EVENT_QUEUE_CFG_ALL_TYPES cap
> > ability
> > - add sched_type in struct rte_event_queue_conf. If capability flag is
> >   not set then implementation takes sched_type value for the queue.
> >
> > Any thoughts?
> 
> 
> Not sure here, would it be ok for the moment, and we can work on a patch in
> the future?

OK

> >> +
> >> +    if (tx_core[lcore_id] && (tx_single ||
> >> +        rte_atomic32_cmpset(&tx_lock, 0, 1))) {
> >> +        consumer();
> >
> > Should consumer() need to come in this pattern? I am thinking like
> > if events is from last stage then call consumer() in worker()
> >
> > I think, above scheme works better when the _same_ worker code need to run
> the
> > case where
> > 1) ethdev HW is capable to enqueuing the packets to same txq from
> >   multiple thread
> > 2) ethdev is not capable to do so.
> >
> > So, The above cases can be addressed in configuration time where we link
> > the queues to port
> > case 1) Link all workers to last queue
> > case 2) Link only worker to last queue
> >
> > and keeping the common worker code.
> >
> > HW implementation has functional and performance issue if "two" ports are
> > assigned to one lcore for dequeue. The above scheme fixes that problem
> too.
> 
> 
> Can we have a bit more discussion on this item? Is this needed for this
> sample app, or can we perhaps work a patch for this later? Harry?

As explained above, Is there any issue in keeping consumer() for last
stage ?

> 
> 
> >
> >> +        rte_atomic32_clear((rte_atomic32_t *)&tx_lock);
> >> +    }
> >> +}
> >> +
> >> +static int
> >> +worker(void *arg)
> >> +{
> >> +    struct rte_event events[BATCH_SIZE];
> >> +
> >> +    struct worker_data *data = (struct worker_data *)arg;
> >> +    uint8_t dev_id = data->dev_id;
> >> +    uint8_t port_id = data->port_id;
> >> +    size_t sent = 0, received = 0;
> >> +    unsigned lcore_id = rte_lcore_id();
> >> +
> >> +    while (!done) {
> >> +        uint16_t i;
> >> +
> >> +        schedule_devices(dev_id, lcore_id);
> >> +
> >> +        if (!worker_core[lcore_id]) {
> >> +            rte_pause();
> >> +            continue;
> >> +        }
> >> +
> >> +        uint16_t nb_rx = rte_event_dequeue_burst(dev_id, port_id,
> >> +                events, RTE_DIM(events), 0);
> >> +
> >> +        if (nb_rx == 0) {
> >> +            rte_pause();
> >> +            continue;
> >> +        }
> >> +        received += nb_rx;
> >> +
> >> +        for (i = 0; i < nb_rx; i++) {
> >> +            struct ether_hdr *eth;
> >> +            struct ether_addr addr;
> >> +            struct rte_mbuf *m = events[i].mbuf;
> >> +
> >> +            /* The first worker stage does classification */
> >> +            if (events[i].queue_id == qid[0])
> >> +                events[i].flow_id = m->hash.rss % num_fids;
> >
> > Not sure why we need do(shrinking the flows) this in worker() in queue
> based pipeline.
> > If an PMD has any specific requirement on num_fids,I think, we
> > can move this configuration stage or PMD can choose optimum fid internally
> to
> > avoid modulus operation tax in fastpath in all PMD.
> >
> > Does struct rte_event_queue_conf.nb_atomic_flows help here?
> 
> In my tests the modulus makes very little difference in the throughput. And
> I think it's good to have a way of varying the number of flows for testing
> different scenarios, even if it's not the most performant.

Not sure.

> 
> >
> >> +
> >> +            events[i].queue_id = next_qid[events[i].queue_id];
> >> +            events[i].op = RTE_EVENT_OP_FORWARD;
> >
> > missing events[i].sched_type.HW PMD does not work with this.
> > I think, we can use similar scheme like next_qid for next_sched_type.
> 
> Done. added events[i].sched_type = queue_type.
> 
> >
> >> +
> >> +            /* change mac addresses on packet (to use mbuf data) */
> >> +            eth = rte_pktmbuf_mtod(m, struct ether_hdr *);
> >> +            ether_addr_copy(&eth->d_addr, &addr);
> >> +            ether_addr_copy(&eth->s_addr, &eth->d_addr);
> >> +            ether_addr_copy(&addr, &eth->s_addr);
> >
> > IMO, We can make packet processing code code as "static inline function"
> so
> > different worker types can reuse.
> 
> Done. moved out to a work() function.

I think, mac swap should do in last stage, not on each forward.
ie. With existing code, 2 stage forward makes in original order.

> 
> >
> >> +
> >> +            /* do a number of cycles of work per packet */
> >> +            volatile uint64_t start_tsc = rte_rdtsc();
> >> +            while (rte_rdtsc() < start_tsc + worker_cycles)
> >> +                rte_pause();
> >
> > Ditto.
> 
> Done. moved out to a work() function.
> 
> >
> > I think, All worker specific variables like "worker_cycles" can moved into
> > one structure and use.
> >
> >> +        }
> >> +        uint16_t nb_tx = rte_event_enqueue_burst(dev_id, port_id,
> >> +                events, nb_rx);
> >> +        while (nb_tx < nb_rx && !done)
> >> +            nb_tx += rte_event_enqueue_burst(dev_id, port_id,
> >> +                            events + nb_tx,
> >> +                            nb_rx - nb_tx);
> >> +        sent += nb_tx;
> >> +    }
> >> +
> >> +    if (!quiet)
> >> +        printf("  worker %u thread done. RX=%zu TX=%zu\n",
> >> +                rte_lcore_id(), received, sent);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/*
> >> + * Parse the coremask given as argument (hexadecimal string) and fill
> >> + * the global configuration (core role and core count) with the parsed
> >> + * value.
> >> + */
> >> +static int xdigit2val(unsigned char c)
> >
> > multiple instance of "xdigit2val" in DPDK repo. May be we can push this
> > as common code.
> 
> Sure, that's something we can look at in a separate patch, now that it's
> being used more and more.

make sense.

> 
> >
> >> +{
> >> +    int val;
> >> +
> >> +    if (isdigit(c))
> >> +        val = c - '0';
> >> +    else if (isupper(c))
> >> +        val = c - 'A' + 10;
> >> +    else
> >> +        val = c - 'a' + 10;
> >> +    return val;
> >> +}
> >> +
> >> +
> >> +static void
> >> +usage(void)
> >> +{
> >> +    const char *usage_str =
> >> +        "  Usage: eventdev_demo [options]\n"
> >> +        "  Options:\n"
> >> +        "  -n, --packets=N              Send N packets (default ~32M), 0
> implies no limit\n"
> >> +        "  -f, --atomic-flows=N         Use N random flows from 1 to N
> (default 16)\n"
> >
> > I think, this parameter now, effects the application fast path code.I
> think,
> > it should eventdev configuration para-mater.
> 
> See above comment on num_fids
> 
> >
> >> +        "  -s, --num_stages=N           Use N atomic stages (default
> 1)\n"
> >> +        "  -r, --rx-mask=core mask      Run NIC rx on CPUs in core
> mask\n"
> >> +        "  -w, --worker-mask=core mask  Run worker on CPUs in core
> mask\n"
> >> +        "  -t, --tx-mask=core mask      Run NIC tx on CPUs in core
> mask\n"
> >> +        "  -e  --sched-mask=core mask   Run scheduler on CPUs in core
> mask\n"
> >> +        "  -c  --cq-depth=N             Worker CQ depth (default 16)\n"
> >> +        "  -W  --work-cycles=N          Worker cycles (default 0)\n"
> >> +        "  -P  --queue-priority         Enable scheduler queue
> prioritization\n"
> >> +        "  -o, --ordered                Use ordered scheduling\n"
> >> +        "  -p, --parallel               Use parallel scheduling\n"
> >
> > IMO, all stage being "parallel" or "ordered" or "atomic" is one mode of
> > operation. It is valid have to any combination. We need to express that in
> > command like
> > example:
> > 3 stage with
> > O->A->P
> 
> How about we add an option that specifies the mode of operation for each
> stage in a string? Maybe have a '-m' option (modes) e.g. '-m appo' for 4
> stages with atomic, parallel, paralled, ordered. Or maybe reuse your
> test-eventdev parameter style?

Any scheme is fine.

> 
> >
> >> +        "  -q, --quiet                  Minimize printed output\n"
> >> +        "  -D, --dump                   Print detailed statistics before
> exit"
> >> +        "\n";
> >> +    fprintf(stderr, "%s", usage_str);
> >> +    exit(1);
> >> +}
> >> +
> >
> > [...]
> >
> >> +            rx_single = (popcnt == 1);
> >> +            break;
> >> +        case 't':
> >> +            tx_lcore_mask = parse_coremask(optarg);
> >> +            popcnt = __builtin_popcountll(tx_lcore_mask);
> >> +            tx_single = (popcnt == 1);
> >> +            break;
> >> +        case 'e':
> >> +            sched_lcore_mask = parse_coremask(optarg);
> >> +            popcnt = __builtin_popcountll(sched_lcore_mask);
> >> +            sched_single = (popcnt == 1);
> >> +            break;
> >> +        default:
> >> +            usage();
> >> +        }
> >> +    }
> >> +
> >> +    if (worker_lcore_mask == 0 || rx_lcore_mask == 0 ||
> >> +        sched_lcore_mask == 0 || tx_lcore_mask == 0) {
> >
> >> +
> >> +    /* Q creation - one load balanced per pipeline stage*/
> >> +
> >> +    /* set up one port per worker, linking to all stage queues */
> >> +    for (i = 0; i < num_workers; i++) {
> >> +        struct worker_data *w = &worker_data[i];
> >> +        w->dev_id = dev_id;
> >> +        if (rte_event_port_setup(dev_id, i, &wkr_p_conf) < 0) {
> >> +            printf("Error setting up port %d\n", i);
> >> +            return -1;
> >> +        }
> >> +
> >> +        uint32_t s;
> >> +        for (s = 0; s < num_stages; s++) {
> >> +            if (rte_event_port_link(dev_id, i,
> >> +                        &worker_queues[s].queue_id,
> >> +                        &worker_queues[s].priority,
> >> +                        1) != 1) {
> >> +                printf("%d: error creating link for port %d\n",
> >> +                        __LINE__, i);
> >> +                return -1;
> >> +            }
> >> +        }
> >> +        w->port_id = i;
> >> +    }
> >> +    /* port for consumer, linked to TX queue */
> >> +    if (rte_event_port_setup(dev_id, i, &tx_p_conf) < 0) {
> >
> > If ethdev supports MT txq queue support then this port can be linked to
> > worker too. something to consider for future.
> >
> 
> Sure. No change for now.

OK
Previous message: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app
Next message: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list