[dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy

Thomas Monjalon thomas at monjalon.net
Thu Oct 19 10:33:36 CEST 2017


19/10/2017 09:51, Li, Xiaoyun:
> From: Thomas Monjalon [mailto:thomas at monjalon.net]
> > 19/10/2017 04:45, Li, Xiaoyun:
> > > Hi
> > > > > >
> > > > > > The significant change of this patch is to call a function
> > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > The perf drop is due to function call replacing inline.
> > > > >
> > > > > > Please could you provide some benchmark numbers?
> > > > > I ran memcpy_perf_test which would show the time cost of memcpy. I
> > > > > ran it on broadwell with sse and avx2.
> > > > > But I just draw pictures and looked at the trend not computed the
> > > > > exact percentage. Sorry about that.
> > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16,
> > > > > 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522,
> > > > > 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144,
> > > > > 6656, 7168,
> > > > 7680, 8192.
> > > > > In my test, the size grows, the drop degrades. (Using copy time
> > > > > indicates the
> > > > > perf.) From the trend picture, when the size is smaller than 128
> > > > > bytes, the perf drops a lot, almost 50%. And above 128 bytes, it
> > > > > approaches the original dpdk.
> > > > > I computed it right now, it shows that when greater than 128 bytes
> > > > > and smaller than 1024 bytes, the perf drops about 15%. When above
> > > > > 1024 bytes, the perf drops about 4%.
> > > > >
> > > > > > From a test done at Mellanox, there might be a performance
> > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > >
> > >
> > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > performance degradation.
> > >
> > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -
> > I" and set fwd txonly.
> > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > Show port stats all and see the throughput pps. But the results are similar
> > and no drop.
> > >
> > > Did I miss something?
> > 
> > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > between
> > 128 and 1024 bytes.
> > But you do not see this drop in your txonly tests, right?
> > 
> Yes. The drop is using test.
> Using command "make test -j" and then " ./build/app/test -c f -n 4 " 
> Then run "memcpy_perf_autotest"
> The results are the cycles that memory copy costs.
> But I just use it to show the trend because I heard that it's not recommended to use micro benchmarks like test_memcpy_perf for memcpy performance report as they aren't likely able to reflect performance of real world applications.

Yes real applications can hide the memcpy cost.
Sometimes, the cost appear for real :)

> Details can be seen at https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk
> 
> And I didn't see drop in testpmd txonly test. Maybe it's because not a lot memcpy calls.

It has been seen in a mlx4 use-case using more memcpy.
I think 15% in micro-benchmark is too much.
What can we do? Raise the threshold?

> > > > Another thing, I will test testpmd txonly with intel nics and
> > > > mellanox these days.
> > > > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any
> > > > improvement.
> > > >
> > > > > > Is there someone else seeing a performance degradation?



More information about the dev mailing list