[dpdk-dev] [Bug 260] bugDPDK lock-free hash deletion

bugzilla at dpdk.org bugzilla at dpdk.org
Tue Apr 30 08:51:44 CEST 2019
Previous message: [dpdk-dev] [Bug 259] unable to run sample t3 as bpf-load for testpmd
Next message: [dpdk-dev] [PATCH] net/i40e: fix link speed issue for X722
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
https://bugs.dpdk.org/show_bug.cgi?id=260

            Bug ID: 260
           Summary: bugDPDK lock-free hash deletion
           Product: DPDK
           Version: 18.11
          Hardware: All
                OS: All
            Status: CONFIRMED
          Severity: normal
          Priority: Normal
         Component: other
          Assignee: dev at dpdk.org
          Reporter: zhongdahulinfan at 163.com
  Target Milestone: ---

lock-free版本的哈希表，在创建的时候指定了flag：
RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF | RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD

以使哈希表支持multi-writer，RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD标记使得每个lcore都可以使用local
cache分配key slot：
struct rte_hash *
rte_hash_create(const struct rte_hash_parameters *params)
{
...
    if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) {
        use_local_cache = 1;
        writer_takes_lock = 1;
    }
...
    /* Store all keys and leave the first entry as a dummy entry for
lookup_bulk */
    if (use_local_cache)
        /*
         * Increase number of slots by total number of indices
         * that can be stored in the lcore caches
         * except for the first cache
         */
        num_key_slots = params->entries + (RTE_MAX_LCORE - 1) *
                    (LCORE_CACHE_SIZE - 1) + 1;
    else
        num_key_slots = params->entries + 1;
...
}

这么做的好处是，每个写者可以从本地cache分配key slot，可减少cache miss，提升哈希插入的性能。

这里要先说一下rte_hash的key和data的存储的结构：

| dummy | pdata + key  | pdata + key | pdata + key  | pdata + key | pdata + key
 | pdata + key | pdata + key  | pdata + key |...
       0      |            1                      2                    3       
              4                    5                      6                    
7                     8      |
               |  <--------------                                              
             bucket                                                            
     ---------->   |

struct rte_hash里的成员key_store就是上述数组，存放key的内容和data指针，其中 index 0没被使用，有效数据从index 1
开始。该数组被划分成若干个bucket，每个bucket的大小为8。这么做是有原因的，rte_hash使用cuckoo
哈希实现，引入bucket解决哈希冲突，而非开链。以写为例，在写哈希表的时候，先哈希到primary
bucket，再循环遍历该bucket找到存储位置，如若有空位则插入，否则继续找secondary bucket。

结构体定义如下：

/* Structure that stores key-value pair */
struct rte_hash_key {
    union {
        uintptr_t idata;
        void *pdata;
    };
    /* Variable key size */
    char key[0];
};

/** Bucket structure */
struct rte_hash_bucket {
    uint16_t sig_current[RTE_HASH_BUCKET_ENTRIES];

    uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];

    uint8_t flag[RTE_HASH_BUCKET_ENTRIES];

    void *next;
} __rte_cache_aligned;


然后这些key index，用一个rte_ring来存储：

struct rte_hash *
rte_hash_create(const struct rte_hash_parameters *params)
{
...
    /* Populate free slots ring. Entry zero is reserved for key misses. */
    for (i = 1; i < num_key_slots; i++)
        rte_ring_sp_enqueue(r, (void *)((uintptr_t) i));
...
}

当使用lock-free哈希表时，删除一个表项的时候key和value采用延后回收内存的策略，使得multi-readers访问哈希表不需要加锁，大大减少多核应用程序的性能损耗。我们调用dpdk
API rte_hash_del_key_xxx删除表项时，只将表项从哈希表中摘除，返回一个key的存储位置position，就是上述所说的key
index，然后在所有引用该表项的readers/writers都退出引用后，再根据position回收key和value的存储空间。key的回收调用一下接口：

int __rte_experimental
rte_hash_free_key_with_position(const struct rte_hash *h,
                const int32_t position)
{
    RETURN_IF_TRUE(((h == NULL) || (position == EMPTY_SLOT)), -EINVAL);

    unsigned int lcore_id, n_slots;
    struct lcore_cache *cached_free_slots;
    const int32_t total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES;

    /* Out of bounds */
    if (position >= total_entries)
        return -EINVAL;

    if (h->use_local_cache) {
        lcore_id = rte_lcore_id();
        cached_free_slots = &h->local_free_slots[lcore_id];
        /* Cache full, need to free it. */
        if (cached_free_slots->len == LCORE_CACHE_SIZE) {
            /* Need to enqueue the free slots in global ring. */
            n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
                        cached_free_slots->objs,
                        LCORE_CACHE_SIZE, NULL);
            cached_free_slots->len -= n_slots;
        }
        /* Put index of new free slot in cache. */
        cached_free_slots->objs[cached_free_slots->len] =
                    (void *)((uintptr_t)position);
        cached_free_slots->len++;
    } else {
        rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)position));
    }

    return 0;
}

仔细看rte_hash的实现，不难发现上述函数有两个地方存在问题：
1、"Out of bounds" 的判断逻辑
这个 "Out of bounds" 的判断逻辑，在哈希表的use_local_cache标识没被置为的时候是成立的，key
index的数量恰好是哈希表entries的数量。但当use_local_cache为真时，它就不正确了。回看一下创建哈希表的函数rte_hash_create，其中key
slots的计算：

/* Store all keys and leave the first entry as a dummy entry for lookup_bulk */
    if (use_local_cache)
        /*
         * Increase number of slots by total number of indices
         * that can be stored in the lcore caches
         * except for the first cache
         */
        num_key_slots = params->entries + (RTE_MAX_LCORE - 1) *
                    (LCORE_CACHE_SIZE - 1) + 1;
    else
        num_key_slots = params->entries + 1;

这时候除了分配entries个key内存空间，还要给每个lcore分配LCORE_CACHE_SIZE数量的key空间，那么此时key的数量是会大于哈希表的total
entry的，所以 rte_hash_free_key_with_position的"Out of bounds"判断逻辑有误。
2、position参数问题
position是rte_hash_del_key_xxx()的返回值，为 (key_idx - 1)。前面说过key的index
0未被使用，从1开始有效。那么直接将position enqueue到free_slot队列是不正确的：
rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)position));
这会导致ring队列里可能存在多个值为0的position，从而损坏ring队列。以ring的size是4为例：
ring初始状态：
|1 | 2 | 3 | 4 |
dequeue完所有key_idx后，返回position为 0,1,2,3，再enqueue回ring队列
一趟dequeue enqueue过后：
| 0 | 1 | 2 | 3 |
dequeue完所有key_idx后，返回position为 0,1,2（其中key_idx 0是无效的，不被使用），再enqueue回ring队列
两趟dequeue enqueue过后：
| 0 | 0 | 1 | 2 |

对比非lock-free的key回收接口，可以发现，remove_entry()是直接将key_indx入队的，所以不存在问题：
static inline void
remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
{
    unsigned lcore_id, n_slots;
    struct lcore_cache *cached_free_slots;

    if (h->use_local_cache) {
        lcore_id = rte_lcore_id();
        cached_free_slots = &h->local_free_slots[lcore_id];
        /* Cache full, need to free it. */
        if (cached_free_slots->len == LCORE_CACHE_SIZE) {
            /* Need to enqueue the free slots in global ring. */
            n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
                        cached_free_slots->objs,
                        LCORE_CACHE_SIZE, NULL);
            cached_free_slots->len -= n_slots;
        }
        /* Put index of new free slot in cache. */
        cached_free_slots->objs[cached_free_slots->len] =
                (void *)((uintptr_t)bkt->key_idx[i]);
        cached_free_slots->len++;
    } else {
        rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)bkt->key_idx[i]));
    }
}

修复方法
1、修改"Out of bounds" 的判断逻辑
2、position在enqueue回free_slots队列之前加1

-- 
You are receiving this mail because:
You are the assignee for the bug.
Previous message: [dpdk-dev] [Bug 259] unable to run sample t3 as bpf-load for testpmd
Next message: [dpdk-dev] [PATCH] net/i40e: fix link speed issue for X722
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list