[PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy

Mattias Rönnblom hofors at lysator.liu.se
Wed Jun 26 20:47:56 CEST 2024
Previous message (by thread): [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Next message (by thread): [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote:
> 
> 
> On 6/26/24 16:58, Stephen Hemminger wrote:
> > On Wed, 26 Jun 2024 10:37:31 +0200
> > Maxime Coquelin <maxime.coquelin at redhat.com> wrote:
> > 
> > > On 6/25/24 21:27, Mattias Rönnblom wrote:
> > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:
> > > > > Hi Mattias,
> > > > > 
> > > > > On 6/20/24 19:57, Mattias Rönnblom wrote:
> > > > > > This patch set make DPDK library, driver, and application code use the
> > > > > > compiler/libc memcpy() by default when functions in <rte_memcpy.h> are
> > > > > > invoked.
> > > > > > 
> > > > > > The various custom DPDK rte_memcpy() implementations may be retained
> > > > > > by means of a build-time option.
> > > > > > 
> > > > > > This patch set only make a difference on x86, PPC and ARM. Loongarch
> > > > > > and RISCV already used compiler/libc memcpy().
> > > > > 
> > > > > It indeed makes a difference on x86!
> > > > > 
> > > > > Just tested latest main with and without your series on
> > > > > Intel(R) Xeon(R) Gold 6438N.
> > > > > 
> > > > > The test is a simple IO loop between a Vhost PMD and a Virtio-user PMD:
> > > > > # dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-net,server=1,mrg_rxbuf=1,in_order=1'
> > > > > --single-file-segments -- -i
> > > > > testpmd> start
> > > > > 
> > > > > # dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
> > > > > 'net_vhost0,iface=vhost-net,client=1'   --single-file-segments -- -i
> > > > > testpmd> start tx_first 32
> > > > > 
> > > > > Latest main: 14.5Mpps
> > > > > Latest main + this series: 10Mpps
> > > > 
> > > > I ran the above benchmark on my Raptor Lake desktop (locked to 3,2
> > > > GHz). GCC 12.3.0.
> > > > 
> > > > Core use_cc_memcpy Mpps
> > > > E    false         9.5
> > > > E    true          9.7
> > > > P    false         16.4
> > > > P    true          13.5
> > > > 
> > > > On the P-cores, there's a significant performance regression, although
> > > > not as bad as the one you see on your Sapphire Rapids Xeon. On the
> > > > E-cores, there's actually a slight performance gain.
> > > > 
> > > > The virtio PMD does not directly invoke rte_memcpy() or anything else
> > > > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
> > > > understand what's going on here. Does the virtio driver delegate some
> > > > performance-critical task to some module that in turns uses
> > > > rte_memcpy()?
> > > 
> > > This is because Vhost is the bottleneck here, not Virtio driver.
> > > Indeed, the virtqueues memory belongs to the Virtio driver and the
> > > descriptors buffers are Virtio's mbufs, so not much memcpy's are done
> > > there.
> > > 
> > > Vhost however, is a heavy memcpy user, as all the descriptors buffers
> > > are copied to/from its mbufs.
> > 
> > Would be good to now the size (if small it is inlining that matters, or
> > maybe alignment matters), and have test results for multiple compiler versions.
> > Ideally, feed results back and update Gcc and Clang.
> 
> I was testing with GCC 11 on RHEL-9:
> gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
> 
> I was using the default one, 64B packets.
> 
> I don't have time to perform these tests, but if you are willing to do
> it I'll be happy to review the results.
> 
> > DPDK doesn't need to be in the optimize C library space.
> 
> Certainly, but we already have an optimized version currently, so not
> much to do now on our side. When C libraries implementations will be on
> par, we should definitely use them by default.
>

I think it's not so much about optimized versus non-optimized at this
point. It's just that cc/libc memcpy sometimes performs better than
RTE memcpy, and sometimes doesn't.

For virtio, a single memory copy in
lib/vhost/virtio_net.c:do_data_copy_enqueue()
is responsible for >95% of the performance regression introduced by
the cc memcpy patch for small packets on Intel P-cores.

I'm not so sure this performance regression will go away in newer
compilers. PGO would certainly help, but PGO is a hassle.

One way to fix this issue would be to introduce a custom,
memcpy()-based packet copying routine. I tried the below patch, with
the following results:

Raptor Lake @ 3,2 GHz
GCC 12

64 bytes packets
Core  Mode              Mpps
----------------------------
E     RTE memcpy        9.5
E     cc memcpy         9.7
E     cc memcpy+pktcpy  9.0

P     RTE memcpy        16.4
P     cc memcpy         13.5
P     cc memcpy+pktcpy  16.2

1500 bytes
Core  Mode              Mpps
----------------------------
P    RTE memcpy         5.8
P    cc memcpy          5.9
P    cc memcpy+pktcpy   5.9

As you can see, most of the regression is eliminated, at the cost of
worse E-core performance. I didn't look at the generated code, but one
could suspect heavy use of wide SIMD is to blame, which E-cores don't
necessarily benefit from.

The below prototype assumes the source and destination buffers are
16-byte aligned. Does that always hold?

I'm sure one could further improve performance using context-specific
information, such as packets always being >= 64 bytes. One could also
consider having special cases, maybe for 64 bytes and MTU-sized
packets. Such are always a hassle when you try to characterize
performance though.

diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index 370402d849..7b595a6622 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t
 	return nr_copies;
 }
 
+static inline void
+pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
+{
+	void *dst = __builtin_assume_aligned(in_dst, 16);
+	const void *src = __builtin_assume_aligned(in_src, 16);
+
+	if (len <= 256) {
+		size_t left;
+
+		for (left = len; left >= 32; left -= 32) {
+			memcpy(dst, src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+			src = RTE_PTR_ADD(src, 32);
+		}
+
+		memcpy(dst, src, left);
+	} else
+		memcpy(dst, src, len);
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	__rte_shared_locks_required(&vq->iotlb_lock)
@@ -240,7 +260,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 	int i;
 
 	for (i = 0; i < count; i++) {
-		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
+		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
 		vhost_log_cache_write_iova(dev, vq, elem[i].log_addr,
 					   elem[i].len);
 		PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0);
@@ -257,7 +277,7 @@ do_data_copy_dequeue(struct vhost_virtqueue *vq)
 	int i;
 
 	for (i = 0; i < count; i++)
-		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
+		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
 
 	vq->batch_copy_nb_elems = 0;
 }



> Maxime
>
Previous message (by thread): [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Next message (by thread): [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list