[PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy

Maxime Coquelin maxime.coquelin at redhat.com
Wed Jun 26 17:24:04 CEST 2024



On 6/26/24 16:58, Stephen Hemminger wrote:
> On Wed, 26 Jun 2024 10:37:31 +0200
> Maxime Coquelin <maxime.coquelin at redhat.com> wrote:
> 
>> On 6/25/24 21:27, Mattias Rönnblom wrote:
>>> On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:
>>>> Hi Mattias,
>>>>
>>>> On 6/20/24 19:57, Mattias Rönnblom wrote:
>>>>> This patch set make DPDK library, driver, and application code use the
>>>>> compiler/libc memcpy() by default when functions in <rte_memcpy.h> are
>>>>> invoked.
>>>>>
>>>>> The various custom DPDK rte_memcpy() implementations may be retained
>>>>> by means of a build-time option.
>>>>>
>>>>> This patch set only make a difference on x86, PPC and ARM. Loongarch
>>>>> and RISCV already used compiler/libc memcpy().
>>>>
>>>> It indeed makes a difference on x86!
>>>>
>>>> Just tested latest main with and without your series on
>>>> Intel(R) Xeon(R) Gold 6438N.
>>>>
>>>> The test is a simple IO loop between a Vhost PMD and a Virtio-user PMD:
>>>> # dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-net,server=1,mrg_rxbuf=1,in_order=1'
>>>> --single-file-segments -- -i
>>>> testpmd> start
>>>>
>>>> # dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
>>>> 'net_vhost0,iface=vhost-net,client=1'   --single-file-segments -- -i
>>>> testpmd> start tx_first 32
>>>>
>>>> Latest main: 14.5Mpps
>>>> Latest main + this series: 10Mpps
>>>>   
>>>
>>> I ran the above benchmark on my Raptor Lake desktop (locked to 3,2
>>> GHz). GCC 12.3.0.
>>>
>>> Core use_cc_memcpy Mpps
>>> E    false         9.5
>>> E    true          9.7
>>> P    false         16.4
>>> P    true          13.5
>>>
>>> On the P-cores, there's a significant performance regression, although
>>> not as bad as the one you see on your Sapphire Rapids Xeon. On the
>>> E-cores, there's actually a slight performance gain.
>>>
>>> The virtio PMD does not directly invoke rte_memcpy() or anything else
>>> from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
>>> understand what's going on here. Does the virtio driver delegate some
>>> performance-critical task to some module that in turns uses
>>> rte_memcpy()?
>>
>> This is because Vhost is the bottleneck here, not Virtio driver.
>> Indeed, the virtqueues memory belongs to the Virtio driver and the
>> descriptors buffers are Virtio's mbufs, so not much memcpy's are done
>> there.
>>
>> Vhost however, is a heavy memcpy user, as all the descriptors buffers
>> are copied to/from its mbufs.
> 
> Would be good to now the size (if small it is inlining that matters, or
> maybe alignment matters), and have test results for multiple compiler versions.
> Ideally, feed results back and update Gcc and Clang.

I was testing with GCC 11 on RHEL-9:
gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)

I was using the default one, 64B packets.

I don't have time to perform these tests, but if you are willing to do
it I'll be happy to review the results.

> DPDK doesn't need to be in the optimize C library space.

Certainly, but we already have an optimized version currently, so not
much to do now on our side. When C libraries implementations will be on
par, we should definitely use them by default.

Maxime



More information about the dev mailing list