[dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

Arvind Narayanan webguru2688 at gmail.com
Tue Sep 11 10:15:27 CEST 2018


I am trying to write a DPDK application and finding it difficult to achieve
line rate on a 10G NIC. I feel this has something to do with CPU caches and
related optimizations, and would be grateful if someone can point me to the
right direction.

I wrap every rte_mbuf into my own structure say, my_packet. Here is
my_packet's structure declaration:

struct my_packet {
 struct rte_mbuf * m;
 uint16_t tag1;
 uint16_t tag2;

During initialization, I reserve a mempool of type struct my_packet with
8192 elements. Whenever I form my_packet, I get them in bursts, similarly
for freeing I put them back into pool as bursts.

So there is a loop in the datapath which touches each of these my_packet's
tag to make a decision.

for (i = 0; i < pkt_count; i++) {
    if (rte_hash_lookup_data(rx_table, &(my_packet[i]->tag1), (void
**)&val[i]) < 0) {

Based on my tests, &(my_packet->tag1) is the cause for not letting me
achieve line rate in the fast path. I say this because if I hardcode the
tag1's value, I am able to achieve line rate. As a workaround, I tried to
use rte_prefetch0() and rte_prefetch_non_temporal() to prefetch 2 to 8
my_packet(s) from my_packet[] array, but nothing seems to boost the

I tried to play with the flags in rte_mempool_create() function call:
-- MEMPOOL_F_NO_SPREAD gives me 8.4GB throughput out of 10G
-- MEMPOOL_F_NO_CACHE_ALIGN initially gives ~9.4G but then gradually
settles to ~8.5GB after 20 or 30 seconds.
-- NO FLAG gives 7.7G

I am running DPDK 18.05 on Ubuntu 16.04.3 LTS.

Any help or pointers are highly appreciated.


More information about the users mailing list