[dpdk-dev] [PATCH 2/3] app/compress-perf: add performance measurement

Daly, Lee lee.daly at intel.com
Mon Oct 15 17:10:22 CEST 2018


Thanks for your input Shally see comments below.


I will be reviewing these changes while Tomasz is out this week.

> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Verma, Shally
> Sent: Friday, October 12, 2018 11:16 AM
> To: Jozwiak, TomaszX <tomaszx.jozwiak at intel.com>; dev at dpdk.org; Trahe,
> Fiona <fiona.trahe at intel.com>; akhil.goyal at nxp.com; De Lara Guarch, Pablo
> <pablo.de.lara.guarch at intel.com>
> Cc: De at dpdk.org; Lara at dpdk.org; Guarch at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/3] app/compress-perf: add performance
> measurement
> 
> HI TomaszX
> 
> Sorry for delay in response. Comments inline.
> 

<...>
> >+static int
> >+comp_perf_check_capabilities(struct comp_test_data *test_data) {
> >+       const struct rte_compressdev_capabilities *cap;
> >+
> >+       cap = rte_compressdev_capability_get(test_data->cdev_id,
> >+                                            RTE_COMP_ALGO_DEFLATE);
> >+
> >+       if (cap == NULL) {
> >+               RTE_LOG(ERR, USER1,
> >+                       "Compress device does not support DEFLATE\n");
> >+               return -1;
> >+       }
> >+
> >+       uint64_t comp_flags = cap->comp_feature_flags;
> >+
> >+       /* Huffman enconding */
> >+       if (test_data->huffman_enc == RTE_COMP_HUFFMAN_FIXED &&
> >+                       (comp_flags & RTE_COMP_FF_HUFFMAN_FIXED) == 0) {
> >+               RTE_LOG(ERR, USER1,
> >+                       "Compress device does not supported Fixed Huffman\n");
> >+               return -1;
> >+       }
> >+
> >+       if (test_data->huffman_enc == RTE_COMP_HUFFMAN_DYNAMIC &&
> >+                       (comp_flags & RTE_COMP_FF_HUFFMAN_DYNAMIC) == 0) {
> >+               RTE_LOG(ERR, USER1,
> >+                       "Compress device does not supported Dynamic Huffman\n");
> >+               return -1;
> >+       }
> >+
> >+       /* Window size */
> >+       if (test_data->window_sz != -1) {
> >+               if (param_range_check(test_data->window_sz,
> >+ &cap->window_size)
> What if cap->window_size is 0 i.e. implementation default?
What do you mean when you say cap->window_size = 0?
Cap->window_size is the range structure here, min, max and increment, which are filled out by the driver.
Our implementation default in the perf tool will set the window size to max the driver can support.

> 
> >+                               < 0) {
> >+                       RTE_LOG(ERR, USER1,
> >+                               "Compress device does not support "
> >+                               "this window size\n");
> >+                       return -1;
> >+               }
> >+       } else
> >+               /* Set window size to PMD maximum if none was specified */
> >+               test_data->window_sz = cap->window_size.max;
> >+

<...>
> >+
> >+static int
> >+comp_perf_dump_input_data(struct comp_test_data *test_data) {
> >+       FILE *f = fopen(test_data->input_file, "r");
> >+
> >+       if (f == NULL) {
> >+               RTE_LOG(ERR, USER1, "Input file could not be opened\n");
> >+               return -1;
> >+       }
> >+
> >+       if (fseek(f, 0, SEEK_END) != 0) {
> >+               RTE_LOG(ERR, USER1, "Size of input could not be calculated\n");
> >+               goto err;
> >+       }
> >+       size_t actual_file_sz = ftell(f);
> >+       /* If extended input data size has not been set,
> >+        * input data size = file size
> >+        */
> >+
> >+       if (test_data->input_data_sz == 0)
> >+               test_data->input_data_sz = actual_file_sz;
> >+
> >+       if (fseek(f, 0, SEEK_SET) != 0) {
> >+               RTE_LOG(ERR, USER1, "Size of input could not be calculated\n");
> >+               goto err;
> >+       }
> >+
> >+       test_data->input_data = rte_zmalloc_socket(NULL,
> >+                               test_data->input_data_sz, 0,
> >+ rte_socket_id());
> >+
> >+       if (test_data->input_data == NULL) {
> >+               RTE_LOG(ERR, USER1, "Memory to hold the data from the input "
> >+                               "file could not be allocated\n");
> >+               goto err;
> >+       }
> >+
> >+       size_t remaining_data = test_data->input_data_sz;
> >+       uint8_t *data = test_data->input_data;
> >+
> >+       while (remaining_data > 0) {
> >+               size_t data_to_read = RTE_MIN(remaining_data,
> >+ actual_file_sz);
> >+
> >+               if (fread(data, data_to_read, 1, f) != 1) {
> >+                       RTE_LOG(ERR, USER1, "Input file could not be read\n");
> >+                       goto err;
> >+               }
> >+               if (fseek(f, 0, SEEK_SET) != 0) {
> >+                       RTE_LOG(ERR, USER1,
> >+                               "Size of input could not be calculated\n");
> >+                       goto err;
> >+               }
> >+               remaining_data -= data_to_read;
> >+               data += data_to_read;
> It looks like it will run 2nd time only if input file size < input data size in which
> case it will just keep filling input buffer with repeated data.
> Is that the intention here?
>From what I can see, yes, this will only enter this while loop a second time if the file is smaller than the data_size requested.
Repeating the data from your input file as much as requested. 
If we were to pad with 0's or random data it would skew the ratio a lot.
Even though I do understand the ratio may be better here in this case as well, due to the repetition of data.

> 
> >+       }
> >+
> >+       if (test_data->input_data_sz > actual_file_sz)
> >+               RTE_LOG(INFO, USER1,
> >+                 "%zu bytes read from file %s, extending the file %.2f times\n",
> >+                       test_data->input_data_sz, test_data->input_file,
> >+                       (double)test_data->input_data_sz/actual_file_sz);
> >+       else
> >+               RTE_LOG(INFO, USER1,
> >+                       "%zu bytes read from file %s\n",
> >+                       test_data->input_data_sz,
> >+ test_data->input_file);
> >+
> >+       fclose(f);
> >+
> >+       return 0;
> >+
> >+err:
> >+       fclose(f);
> >+       rte_free(test_data->input_data);
> >+       test_data->input_data = NULL;
> >+
> >+       return -1;
> >+}
> >+
> >+static int
> >+comp_perf_initialize_compressdev(struct comp_test_data *test_data) {
> >+       uint8_t enabled_cdev_count;
> >+       uint8_t enabled_cdevs[RTE_COMPRESS_MAX_DEVS];
> >+
> >+       enabled_cdev_count = rte_compressdev_devices_get(test_data-
> >driver_name,
> >+                       enabled_cdevs, RTE_COMPRESS_MAX_DEVS);
> >+       if (enabled_cdev_count == 0) {
> >+               RTE_LOG(ERR, USER1, "No compress devices type %s available\n",
> >+                               test_data->driver_name);
> >+               return -EINVAL;
> >+       }
> >+
> >+       if (enabled_cdev_count > 1)
> >+               RTE_LOG(INFO, USER1,
> >+                       "Only the first compress device will be
> >+ used\n");
> >+
> >+       test_data->cdev_id = enabled_cdevs[0];
> >+
> >+       if (comp_perf_check_capabilities(test_data) < 0)
> >+               return -1;
> >+
> >+       /* Configure compressdev (one device, one queue pair) */
> >+       struct rte_compressdev_config config = {
> >+               .socket_id = rte_socket_id(),
> >+               .nb_queue_pairs = 1,
> >+               .max_nb_priv_xforms = NUM_MAX_XFORMS,
> >+               .max_nb_streams = 0
> >+       };
> >+
> >+       if (rte_compressdev_configure(test_data->cdev_id, &config) < 0) {
> >+               RTE_LOG(ERR, USER1, "Device configuration failed\n");
> >+               return -1;
> >+       }
> >+
> >+       if (rte_compressdev_queue_pair_setup(test_data->cdev_id, 0,
> >+                       NUM_MAX_INFLIGHT_OPS, rte_socket_id()) < 0) {
> >+               RTE_LOG(ERR, USER1, "Queue pair setup failed\n");
> >+               return -1;
> >+       }
> >+
> >+       if (rte_compressdev_start(test_data->cdev_id) < 0) {
> >+               RTE_LOG(ERR, USER1, "Device could not be started\n");
> >+               return -1;
> >+       }
> >+
> >+       return 0;
> >+}
> >+
> >+static int
> >+prepare_bufs(struct comp_test_data *test_data) {
> >+       uint32_t remaining_data = test_data->input_data_sz;
> >+       uint8_t *input_data_ptr = test_data->input_data;
> >+       size_t data_sz;
> >+       uint8_t *data_addr;
> >+       uint32_t i, j;
> >+
> >+       for (i = 0; i < test_data->total_bufs; i++) {
> >+               /* Allocate data in input mbuf and copy data from input file */
> >+               test_data->decomp_bufs[i] =
> >+                       rte_pktmbuf_alloc(test_data->decomp_buf_pool);
> >+               if (test_data->decomp_bufs[i] == NULL) {
> >+                       RTE_LOG(ERR, USER1, "Could not allocate mbuf\n");
> >+                       return -1;
> >+               }
> >+
> >+               data_sz = RTE_MIN(remaining_data, test_data->seg_sz);
> >+               data_addr = (uint8_t *) rte_pktmbuf_append(
> >+                                       test_data->decomp_bufs[i], data_sz);
> >+               if (data_addr == NULL) {
> >+                       RTE_LOG(ERR, USER1, "Could not append data\n");
> >+                       return -1;
> >+               }
> >+               rte_memcpy(data_addr, input_data_ptr, data_sz);
> >+
> >+               input_data_ptr += data_sz;
> >+               remaining_data -= data_sz;
> >+
> >+               /* Already one segment in the mbuf */
> >+               uint16_t segs_per_mbuf = 1;
> >+
> >+               /* Chain mbufs if needed for input mbufs */
> >+               while (segs_per_mbuf < test_data->max_sgl_segs
> >+                               && remaining_data > 0) {
> >+                       struct rte_mbuf *next_seg =
> >+
> >+ rte_pktmbuf_alloc(test_data->decomp_buf_pool);
> >+
> >+                       if (next_seg == NULL) {
> >+                               RTE_LOG(ERR, USER1,
> >+                                       "Could not allocate mbuf\n");
> >+                               return -1;
> >+                       }
> >+
> >+                       data_sz = RTE_MIN(remaining_data, test_data->seg_sz);
> >+                       data_addr = (uint8_t *)rte_pktmbuf_append(next_seg,
> >+                               data_sz);
> >+
> >+                       if (data_addr == NULL) {
> >+                               RTE_LOG(ERR, USER1, "Could not append
> >+ data\n");
> Since a new buffer per segment is allocated, so is it possible for append to
> fail? think, this check is redundant here.

True.

> >+                               return -1;
> >+                       }
> >+
> >+                       rte_memcpy(data_addr, input_data_ptr, data_sz);
> >+                       input_data_ptr += data_sz;
> >+                       remaining_data -= data_sz;
> >+
> >+                       if (rte_pktmbuf_chain(test_data->decomp_bufs[i],
> >+                                       next_seg) < 0) {
> >+                               RTE_LOG(ERR, USER1, "Could not chain mbufs\n");
> >+                               return -1;
> >+                       }
> >+                       segs_per_mbuf++;
> >+               }
> >+

<...>
> >+
> >+       /* Create private xform */
> >+       if (rte_compressdev_private_xform_create(dev_id, &xform,
> >+                       &priv_xform) < 0) {
> >+               RTE_LOG(ERR, USER1, "Private xform could not be created\n");
> >+               res = -1;
> >+               goto end;
> >+       }
> >+
> >+       uint64_t tsc_start, tsc_end, tsc_duration;
> >+
> >+       tsc_start = tsc_end = tsc_duration = 0;
> >+       if (benchmarking) {
> >+               tsc_start = rte_rdtsc();
> >+               num_iter = test_data->num_iter;
> >+       } else
> >+               num_iter = 1;
> Looks like in same code we're doing benchmarking and functional validation.
> It can be reorganised to keep validation test separately like done in
> crypto_perf.
Good point, will keep that in mind when doing these changes.

> 
> >+
> >+       for (iter = 0; iter < num_iter; iter++) {
> >+               uint32_t total_ops = test_data->total_bufs;
> >+               uint32_t remaining_ops = test_data->total_bufs;
> >+               uint32_t total_deq_ops = 0;
> >+               uint32_t total_enq_ops = 0;
> >+               uint16_t ops_unused = 0;
> >+               uint16_t num_enq = 0;
> >+               uint16_t num_deq = 0;
> >+
> >+               output_size = 0;
> >+
> >+               while (remaining_ops > 0) {
> >+                       uint16_t num_ops = RTE_MIN(remaining_ops,
> >+                                                  test_data->burst_sz);
> >+                       uint16_t ops_needed = num_ops - ops_unused;
> >+
> >+                       /*
> >+                        * Move the unused operations from the previous
> >+                        * enqueue_burst call to the front, to maintain order
> >+                        */
> >+                       if ((ops_unused > 0) && (num_enq > 0)) {
> >+                               size_t nb_b_to_mov =
> >+                                     ops_unused * sizeof(struct
> >+ rte_comp_op *);
> >+
> >+                               memmove(ops, &ops[num_enq], nb_b_to_mov);
> >+                       }
> >+
> >+                       /* Allocate compression operations */
> >+                       if (ops_needed && !rte_comp_op_bulk_alloc(
> >+                                               test_data->op_pool,
> >+                                               &ops[ops_unused],
> >+                                               ops_needed)) {
> >+                               RTE_LOG(ERR, USER1,
> >+                                     "Could not allocate enough operations\n");
> >+                               res = -1;
> >+                               goto end;
> >+                       }
> >+                       allocated += ops_needed;
> >+
> >+                       for (i = 0; i < ops_needed; i++) {
> >+                               /*
> >+                                * Calculate next buffer to attach to operation
> >+                                */
> >+                               uint32_t buf_id = total_enq_ops + i +
> >+                                               ops_unused;
> >+                               uint16_t op_id = ops_unused + i;
> >+                               /* Reset all data in output buffers */
> >+                               struct rte_mbuf *m =
> >+ output_bufs[buf_id];
> >+
> >+                               m->pkt_len = test_data->seg_sz *
> >+ m->nb_segs;
> Isn't pkt_len set already when we call rte_pktmbuf_append() and chain()?
> 
> >+                               while (m) {
> >+                                       m->data_len = m->buf_len -
> >+ m->data_off;
> Same question, shouldn't rte_pktmbuf_append() adjust data_len as well per
> each mbuf?
Yes you are correct,
>From what I can see the *m mbuf pointer is redundant. 

> 
> >+                                       m = m->next;
> >+                               }
> >+                               ops[op_id]->m_src = input_bufs[buf_id];
> >+                               ops[op_id]->m_dst = output_bufs[buf_id];
> >+                               ops[op_id]->src.offset = 0;
> >+                               ops[op_id]->src.length =
> >+                                       rte_pktmbuf_pkt_len(input_bufs[buf_id]);
> >+                               ops[op_id]->dst.offset = 0;
> >+                               ops[op_id]->flush_flag = RTE_COMP_FLUSH_FINAL;
> >+                               ops[op_id]->input_chksum = buf_id;
> >+                               ops[op_id]->private_xform = priv_xform;
> >+                       }
> >+
> >+                       num_enq = rte_compressdev_enqueue_burst(dev_id, 0, ops,
> >+                                                               num_ops);
> >+                       ops_unused = num_ops - num_enq;
> >+                       remaining_ops -= num_enq;
> >+                       total_enq_ops += num_enq;
> >+
> >+                       num_deq = rte_compressdev_dequeue_burst(dev_id, 0,
> >+                                                          deq_ops,
> >+                                                          test_data->burst_sz);
> >+                       total_deq_ops += num_deq;
> >+                       if (benchmarking == 0) {
> >+                               for (i = 0; i < num_deq; i++) {
> >+                                       struct rte_comp_op *op = deq_ops[i];
> >+                                       const void *read_data_addr =
> >+                                               rte_pktmbuf_read(op->m_dst, 0,
> >+                                               op->produced, output_data_ptr);
> >+                                       if (read_data_addr == NULL) {
> >+                                               RTE_LOG(ERR, USER1,
> >+                                     "Could not copy buffer in destination\n");
> >+                                               res = -1;
> >+                                               goto end;
> >+                                       }
> >+
> >+                                       if (read_data_addr != output_data_ptr)
> >+                                               rte_memcpy(output_data_ptr,
> >+                                                       rte_pktmbuf_mtod(
> >+                                                         op->m_dst, uint8_t *),
> >+                                                       op->produced);
> >+                                       output_data_ptr += op->produced;
> >+                                       output_size += op->produced;
> >+
> >+                               }
> >+                       }
> >+
> >+                       if (iter == num_iter - 1) {
> >+                               for (i = 0; i < num_deq; i++) {
> Why is it only for last iteration, we are adjusting dst mbuf data_len.?
> Shouldn't it be done for each dequeued op?
> And, for benchmarking, do we even need to set data and pkt len on dst
> mbuf?
I assume the data_len is only getting changed on the last iteration, for the reason you gave, benchmarking, to save cycles.
Does it need to be at all? Possibly not. 
> 
> >+                                       struct rte_comp_op *op = deq_ops[i];
> >+                                       struct rte_mbuf *m = op->m_dst;
> >+
> >+                                       m->pkt_len = op->produced;
> >+                                       uint32_t remaining_data = op->produced;
> >+                                       uint16_t data_to_append;
> >+
> >+                                       while (remaining_data > 0) {
> >+                                               data_to_append =
> >+                                                       RTE_MIN(remaining_data,
> >+                                                            test_data->seg_sz);
> >+                                               m->data_len = data_to_append;
> >+                                               remaining_data -=
> >+                                                               data_to_append;
> >+                                               m = m->next;
> Should break if m->next == NULL
Yup, you are correct, should be a check here.
> >+                                       }
> >+                               }
> >+                       }
> >+                       rte_mempool_put_bulk(test_data->op_pool,
> >+                                            (void **)deq_ops, num_deq);
> >+                       allocated -= num_deq;
> >+               }
> >+
> >+               /* Dequeue the last operations */
> >+               while (total_deq_ops < total_ops) {
> >+                       num_deq = rte_compressdev_dequeue_burst(dev_id, 0,
> >+                                               deq_ops, test_data->burst_sz);
> >+                       total_deq_ops += num_deq;
> >+                       if (benchmarking == 0) {
> >+                               for (i = 0; i < num_deq; i++) {
> >+                                       struct rte_comp_op *op = deq_ops[i];
> >+                                       const void *read_data_addr =
> >+                                               rte_pktmbuf_read(op->m_dst, 0,
> >+                                               op->produced, output_data_ptr);
> >+                                       if (read_data_addr == NULL) {
> >+                                               RTE_LOG(ERR, USER1,
> >+                                     "Could not copy buffer in destination\n");
> >+                                               res = -1;
> >+                                               goto end;
> >+                                       }
> >+
> >+                                       if (read_data_addr != output_data_ptr)
> >+                                               rte_memcpy(output_data_ptr,
> >+                                                       rte_pktmbuf_mtod(
> >+                                                       op->m_dst, uint8_t *),
> >+                                                       op->produced);
> >+                                       output_data_ptr += op->produced;
> >+                                       output_size += op->produced;
> >+
> >+                               }
> >+                       }
> >+
> >+                       if (iter == num_iter - 1) {
> >+                               for (i = 0; i < num_deq; i++) {
> >+                                       struct rte_comp_op *op = deq_ops[i];
> >+                                       struct rte_mbuf *m = op->m_dst;
> >+
> >+                                       m->pkt_len = op->produced;
> >+                                       uint32_t remaining_data = op->produced;
> >+                                       uint16_t data_to_append;
> >+
> >+                                       while (remaining_data > 0) {
> >+                                               data_to_append =
> >+                                               RTE_MIN(remaining_data,
> >+                                                       test_data->seg_sz);
> >+                                               m->data_len = data_to_append;
> >+                                               remaining_data -=
> >+                                                               data_to_append;
> >+                                               m = m->next;
> >+                                       }
> >+                               }
> >+                       }
> >+                       rte_mempool_put_bulk(test_data->op_pool,
> >+                                            (void **)deq_ops, num_deq);
> >+                       allocated -= num_deq;
> >+               }
> >+       }
> >+
> >+       if (benchmarking) {
> >+               tsc_end = rte_rdtsc();
> >+               tsc_duration = tsc_end - tsc_start;
> >+
> >+               if (type == RTE_COMP_COMPRESS)
> test looks for stateless operations only, so can we add perf test type like: test
> type perf, op type:STATELESS/STATEFUL 
Are you asking for the tool to support stateful ops? Since no drivers support stateful yet 
We just wanted to ensure current driver functionality was covered with this first version.

>Also, why do we need --max-num-
> sgl-segs as an input option from user? Shouldn't input_sz and seg_sz
> internally decide on num-segs?
> Or is it added to serve some other different purpose?
Will have to get back to you on this one, seems illogical to get this input from user,
But I will have to do further investigation to find if there was a different purpose. 
> 
> Thanks
> Shally
> 
Thanks for the feedback, 
We hope to get V2 sent asap.




More information about the dev mailing list