[PATCH v3] app/dma-perf: introduce dma-perf application
    Jiang, Cheng1 
    cheng1.jiang at intel.com
       
    Mon Feb  6 15:20:13 CET 2023
    
    
  
Hi Bruce,
Replies are inline,
Thank,
Cheng
> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson at intel.com>
> Sent: Monday, January 30, 2023 5:20 PM
> To: Jiang, Cheng1 <cheng1.jiang at intel.com>
> Cc: thomas at monjalon.net; mb at smartsharesystems.com; dev at dpdk.org;
> Hu, Jiayu <jiayu.hu at intel.com>; Ding, Xuan <xuan.ding at intel.com>; Ma,
> WenwuX <wenwux.ma at intel.com>; Wang, YuanX
> <yuanx.wang at intel.com>; He, Xingguang <xingguang.he at intel.com>
> Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
> 
> On Sat, Jan 28, 2023 at 01:32:05PM +0000, Jiang, Cheng1 wrote:
> > Hi Bruce,
> >
> > Sorry for the late reply. We are in the Spring Festival holiday last week.
> > Thanks for your comments.
> > Replies are inline.
> >
> > Thanks,
> > Cheng
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce <bruce.richardson at intel.com>
> > > Sent: Wednesday, January 18, 2023 12:52 AM
> > > To: Jiang, Cheng1 <cheng1.jiang at intel.com>
> > > Cc: thomas at monjalon.net; mb at smartsharesystems.com;
> dev at dpdk.org; Hu,
> > > Jiayu <jiayu.hu at intel.com>; Ding, Xuan <xuan.ding at intel.com>; Ma,
> > > WenwuX <wenwux.ma at intel.com>; Wang, YuanX
> <yuanx.wang at intel.com>;
> > > He, Xingguang <xingguang.he at intel.com>
> > > Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
> > >
> > > On Tue, Jan 17, 2023 at 12:05:26PM +0000, Cheng Jiang wrote:
> > > > There are many high-performance DMA devices supported in DPDK
> now,
> > > > and these DMA devices can also be integrated into other modules of
> > > > DPDK as accelerators, such as Vhost. Before integrating DMA into
> > > > applications, developers need to know the performance of these DMA
> > > > devices in various scenarios and the performance of CPUs in the
> > > > same scenario, such as different buffer lengths. Only in this way
> > > > can we know the target performance of the application accelerated
> > > > by using them. This patch introduces a high-performance testing
> > > > tool, which supports comparing the performance of CPU and DMA in
> > > > different scenarios automatically with a pre-set config file.
> > > > Memory Copy performance test are
> > > supported for now.
> > > >
> > > > Signed-off-by: Cheng Jiang <cheng1.jiang at intel.com>
> > > > Signed-off-by: Jiayu Hu <jiayu.hu at intel.com>
> > > > Signed-off-by: Yuan Wang <yuanx.wang at intel.com>
> > > > Acked-by: Morten Brørup <mb at smartsharesystems.com>
> > > > ---
> > >
> > > More input based off trying running the application, including some
> > > thoughts on the testing methodology below.
> > >
> > >
> > > > +static void
> > > > +output_result(uint8_t scenario_id, uint32_t lcore_id, uint16_t
> > > > +dev_id,
> > > uint64_t ave_cycle,
> > > > +			uint32_t buf_size, uint32_t nr_buf, uint32_t memory,
> > > > +			float bandwidth, uint64_t ops, bool is_dma) {
> > > > +	if (is_dma)
> > > > +		printf("lcore %u, DMA %u:\n"
> > > > +				"average cycles: %" PRIu64 ","
> > > > +				" buffer size: %u, nr_buf: %u,"
> > > > +				" memory: %uMB, frequency: %" PRIu64
> ".\n",
> > > > +				lcore_id,
> > > > +				dev_id,
> > > > +				ave_cycle,
> > > > +				buf_size,
> > > > +				nr_buf,
> > > > +				memory,
> > > > +				rte_get_timer_hz());
> > > > +	else
> > > > +		printf("lcore %u\n"
> > > > +			"average cycles: %" PRIu64 ","
> > > > +			" buffer size: %u, nr_buf: %u,"
> > > > +			" memory: %uMB, frequency: %" PRIu64 ".\n",
> > > > +			lcore_id,
> > > > +			ave_cycle,
> > > > +			buf_size,
> > > > +			nr_buf,
> > > > +			memory,
> > > > +			rte_get_timer_hz());
> > > > +
> > >
> > > The term "average cycles" is unclear here - is it average cycles per
> > > test iteration, or average cycles per buffer copy?
> >
> > The average cycles stands for average cycles per buffer copy, I'll clarify it in
> the next version.
> >
> > >
> > >
> > > > +	printf("Average bandwidth: %.3lfGbps, OPS: %" PRIu64 "\n",
> > > > +bandwidth, ops);
> > > > +
> > >
> > > <snip>
> > >
> > > > +
> > > > +static inline void
> > > > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t
> > > > +kick_batch,
> > > uint32_t buf_size,
> > > > +			uint16_t mpool_iter_step, struct rte_mbuf **srcs,
> > > struct rte_mbuf
> > > > +**dsts) {
> > > > +	int64_t async_cnt = 0;
> > > > +	int nr_cpl = 0;
> > > > +	uint32_t index;
> > > > +	uint16_t offset;
> > > > +	uint32_t i;
> > > > +
> > > > +	for (offset = 0; offset < mpool_iter_step; offset++) {
> > > > +		for (i = 0; index = i * mpool_iter_step + offset, index <
> > > > +nr_buf;
> > > i++) {
> > > > +			if (unlikely(rte_dma_copy(dev_id,
> > > > +						0,
> > > > +						srcs[index]->buf_iova +
> > > srcs[index]->data_off,
> > > > +						dsts[index]->buf_iova +
> > > dsts[index]->data_off,
> > > > +						buf_size,
> > > > +						0) < 0)) {
> > > > +				rte_dma_submit(dev_id, 0);
> > > > +				while (rte_dma_burst_capacity(dev_id, 0) ==
> 0)
> > > {
> > > > +					nr_cpl = rte_dma_completed(dev_id,
> 0,
> > > MAX_DMA_CPL_NB,
> > > > +								NULL, NULL);
> > > > +					async_cnt -= nr_cpl;
> > > > +				}
> > > > +				if (rte_dma_copy(dev_id,
> > > > +						0,
> > > > +						srcs[index]->buf_iova +
> > > srcs[index]->data_off,
> > > > +						dsts[index]->buf_iova +
> > > dsts[index]->data_off,
> > > > +						buf_size,
> > > > +						0) < 0) {
> > > > +					printf("enqueue fail again at %u\n",
> > > index);
> > > > +					printf("space:%d\n",
> > > rte_dma_burst_capacity(dev_id, 0));
> > > > +					rte_exit(EXIT_FAILURE, "DMA
> enqueue
> > > failed\n");
> > > > +				}
> > > > +			}
> > > > +			async_cnt++;
> > > > +
> > > > +			/**
> > > > +			 * When '&' is used to wrap an index, mask must be a
> > > power of 2.
> > > > +			 * That is, kick_batch must be 2^n.
> > > > +			 */
> > > > +			if (unlikely((async_cnt % kick_batch) == 0)) {
> > > > +				rte_dma_submit(dev_id, 0);
> > > > +				/* add a poll to avoid ring full */
> > > > +				nr_cpl = rte_dma_completed(dev_id, 0,
> > > MAX_DMA_CPL_NB, NULL, NULL);
> > > > +				async_cnt -= nr_cpl;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		rte_dma_submit(dev_id, 0);
> > > > +		while (async_cnt > 0) {
> > > > +			nr_cpl = rte_dma_completed(dev_id, 0,
> > > MAX_DMA_CPL_NB, NULL, NULL);
> > > > +			async_cnt -= nr_cpl;
> > > > +		}
> > >
> > > I have a couple of concerns about the methodology for testing the HW
> > > DMA performance. For example, the inclusion of that final block
> > > means that we are including the latency of the copy operation in the
> result.
> > >
> > > If the objective of the test application is to determine if it is
> > > cheaper for software to offload a copy operation to HW or do it in
> > > SW, then the primary concern is the HW offload cost. That offload
> > > cost should remain constant irrespective of the size of the copy -
> > > since all you are doing is writing a descriptor and reading a
> > > completion result. However, seeing the results of running the app, I
> > > notice that the reported average cycles increases as the packet size
> > > increases, which would tend to indicate that we are not giving a realistic
> measurement of offload cost.
> >
> > We are trying to compare the time required to complete a certain
> > amount of work using DMA with the time required to complete it using
> > CPU. I think in addition to the offload cost, the capability of the DMA itself
> is also an important factor to be considered.
> > The offload cost should be constant , but when DMA copies memory of
> > different lengths, the time costs are different. So the reported average
> cycles increases as the packet size increases.
> > Therefore, this test result includes both offload cost and DMA
> > operation cost. To some extent, it should be a relative realistic
> measurement result.
> >
> > Do you think it makes sense to you?
> >
> 
> Hi,
> 
> Yes, I get your point about the job latency being different when the
> packet/copy sizes increase, but on the other hand, as I state above the actual
> cycle cost to the application should not increase. If any application is doing
> what this test app is doing, just sitting around waiting for job completion (in
> the fast path), then it is likely that the programmer should look at improving
> the offload into the app.
> 
> The main issue here is that by outputting a single number, you are mixing
> two separate values - both offload cost and job latency. If you want to show
> the effects of larger/smaller packets on both, then you should output both
> values separately. For most applications where you will offload copies and do
> other work while the copy is being done, the offload cost is of primary
> concern. For some applications the latency figure may also be important, but
> in those cases the user will want to see the latency called out explicitly, not
> just mixed up in a single figure with offload cost.
Sure, makes sense to me, thanks.
> 
> > >
> > > The trouble then becomes how to do so in a more realistic manner.
> > > The most accurate way I can think of in a unit test like this is to
> > > offload <queue_size> entries to the device and measure the cycles
> > > taken there. Then wait until such time as all copies are completed
> > > (to eliminate the latency time, which in a real- world case would be
> > > spent by a core doing something else), and then do a second
> > > measurement of the time taken to process all the completions. In the
> > > same way as for a SW copy, any time not spent in memcpy is not copy
> > > time, for HW copies any time spent not writing descriptors or reading
> completions is not part of the offload cost.
> >
> > Agreed, we are thinking about adding offload cost as one of test results in
> the future.
> >
> > >
> > > That said, doing the above is still not fully realistic, as a
> > > real-world app will likely still have some amount of other overhead,
> > > for example, polling occasionally for completions in between doing
> > > other work (though one would expect this to be relatively cheap).
> > > Similarly, if the submission queue fills, the app may have to delay
> > > waiting for space to submit jobs, and therefore see some of the HW copy
> latency.
> > >
> > > Therefore, I think the most realistic way to measure this is to look
> > > at the rate of operations while processing is being done in the
> > > middle of the test. For example, if we have a simple packet
> > > processing application, running the application just doing RX and TX
> > > and measuring the rate allows us to determine the basic packet I/O
> > > cost. Adding in an offload to HW for each packet and again measuring
> > > the rate, will then allow us to compute the true offload copy cost
> > > of the operation, and should give us a number that remains flat even
> > > as packet size increases. For previous work done on vhost with DMA
> > > acceleration, I believe we saw exactly that - while SW PPS reduced as
> packet size increased, with HW copies the PPS remained constant even as
> packet size increased.
> > >
> > > The challenge to my mind, is therefore how to implement this in a
> > > suitable unit- test style way, to fit into the framework you have
> > > given here. I would suggest that the actual performance measurement
> > > needs to be done - not on a total time - but on a fixed time basis
> > > within each test. For example, when doing HW copies, 1ms into each
> > > test run, we need to snapshot the completed entries, and then say
> > > 1ms later measure the number that have been completed since. In this
> > > way, we avoid the initial startup latency while we wait for jobs to start
> completing, and we avoid the final latency as we await the last job to
> complete.
> > > We would also include time for some potentially empty polls, and if
> > > a queue size is too small see that reflected in the performance too.
> >
> > I understand your concerns, but I think maybe we are not discussing the
> same performance number here.
> > We are trying to test the maximum bandwidth of DMA, and what you said
> is how to measure the offload cost more accurately if I understand it
> correctly.
> > I think these two performance data are both important. Maybe we can
> > add your test methodology as one of performance aspect for DMA in the
> future, I need to reconsider it and get back to you later.
> >
> 
> Max bandwidth of HW is a third and separate number from that of offload-
> cost and latency. Again, it should be measured and reported separately if you
> want the app to provide it.
OK, got it. We will try to implement such test method in the next version.
Thanks.
Cheng
> 
> Regards,
> 
> /Bruce
    
    
More information about the dev
mailing list