[dpdk-dev] [Bug 260] bugDPDK lock-free hash deletion

bugzilla at dpdk.org bugzilla at dpdk.org
Tue Apr 30 08:51:44 CEST 2019


https://bugs.dpdk.org/show_bug.cgi?id=260

            Bug ID: 260
           Summary: bugDPDK lock-free hash deletion
           Product: DPDK
           Version: 18.11
          Hardware: All
                OS: All
            Status: CONFIRMED
          Severity: normal
          Priority: Normal
         Component: other
          Assignee: dev at dpdk.org
          Reporter: zhongdahulinfan at 163.com
  Target Milestone: ---

lock-free版本的哈希表,在创建的时候指定了flag:
RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF | RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD

以使哈希表支持multi-writer,RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD标记使得每个lcore都可以使用local
cache分配key slot:
struct rte_hash *
rte_hash_create(const struct rte_hash_parameters *params)
{
...
    if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) {
        use_local_cache = 1;
        writer_takes_lock = 1;
    }
...
    /* Store all keys and leave the first entry as a dummy entry for
lookup_bulk */
    if (use_local_cache)
        /*
         * Increase number of slots by total number of indices
         * that can be stored in the lcore caches
         * except for the first cache
         */
        num_key_slots = params->entries + (RTE_MAX_LCORE - 1) *
                    (LCORE_CACHE_SIZE - 1) + 1;
    else
        num_key_slots = params->entries + 1;
...
}

这么做的好处是,每个写者可以从本地cache分配key slot,可减少cache miss,提升哈希插入的性能。

这里要先说一下rte_hash的key和data的存储的结构:

| dummy | pdata + key  | pdata + key | pdata + key  | pdata + key | pdata + key
 | pdata + key | pdata + key  | pdata + key |...
       0      |            1                      2                    3       
              4                    5                      6                    
7                     8      |
               |  <--------------                                              
             bucket                                                            
     ---------->   |

struct rte_hash里的成员key_store就是上述数组,存放key的内容和data指针,其中 index 0没被使用,有效数据从index 1
开始。该数组被划分成若干个bucket,每个bucket的大小为8。这么做是有原因的,rte_hash使用cuckoo
哈希实现,引入bucket解决哈希冲突,而非开链。以写为例,在写哈希表的时候,先哈希到primary
bucket,再循环遍历该bucket找到存储位置,如若有空位则插入,否则继续找secondary bucket。

结构体定义如下:

/* Structure that stores key-value pair */
struct rte_hash_key {
    union {
        uintptr_t idata;
        void *pdata;
    };
    /* Variable key size */
    char key[0];
};

/** Bucket structure */
struct rte_hash_bucket {
    uint16_t sig_current[RTE_HASH_BUCKET_ENTRIES];

    uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];

    uint8_t flag[RTE_HASH_BUCKET_ENTRIES];

    void *next;
} __rte_cache_aligned;


然后这些key index,用一个rte_ring来存储:

struct rte_hash *
rte_hash_create(const struct rte_hash_parameters *params)
{
...
    /* Populate free slots ring. Entry zero is reserved for key misses. */
    for (i = 1; i < num_key_slots; i++)
        rte_ring_sp_enqueue(r, (void *)((uintptr_t) i));
...
}

当使用lock-free哈希表时,删除一个表项的时候key和value采用延后回收内存的策略,使得multi-readers访问哈希表不需要加锁,大大减少多核应用程序的性能损耗。我们调用dpdk
API rte_hash_del_key_xxx删除表项时,只将表项从哈希表中摘除,返回一个key的存储位置position,就是上述所说的key
index,然后在所有引用该表项的readers/writers都退出引用后,再根据position回收key和value的存储空间。key的回收调用一下接口:

int __rte_experimental
rte_hash_free_key_with_position(const struct rte_hash *h,
                const int32_t position)
{
    RETURN_IF_TRUE(((h == NULL) || (position == EMPTY_SLOT)), -EINVAL);

    unsigned int lcore_id, n_slots;
    struct lcore_cache *cached_free_slots;
    const int32_t total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES;

    /* Out of bounds */
    if (position >= total_entries)
        return -EINVAL;

    if (h->use_local_cache) {
        lcore_id = rte_lcore_id();
        cached_free_slots = &h->local_free_slots[lcore_id];
        /* Cache full, need to free it. */
        if (cached_free_slots->len == LCORE_CACHE_SIZE) {
            /* Need to enqueue the free slots in global ring. */
            n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
                        cached_free_slots->objs,
                        LCORE_CACHE_SIZE, NULL);
            cached_free_slots->len -= n_slots;
        }
        /* Put index of new free slot in cache. */
        cached_free_slots->objs[cached_free_slots->len] =
                    (void *)((uintptr_t)position);
        cached_free_slots->len++;
    } else {
        rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)position));
    }

    return 0;
}

仔细看rte_hash的实现,不难发现上述函数有两个地方存在问题:
1、"Out of bounds" 的判断逻辑
这个 "Out of bounds" 的判断逻辑,在哈希表的use_local_cache标识没被置为的时候是成立的,key
index的数量恰好是哈希表entries的数量。但当use_local_cache为真时,它就不正确了。回看一下创建哈希表的函数rte_hash_create,其中key
slots的计算:

/* Store all keys and leave the first entry as a dummy entry for lookup_bulk */
    if (use_local_cache)
        /*
         * Increase number of slots by total number of indices
         * that can be stored in the lcore caches
         * except for the first cache
         */
        num_key_slots = params->entries + (RTE_MAX_LCORE - 1) *
                    (LCORE_CACHE_SIZE - 1) + 1;
    else
        num_key_slots = params->entries + 1;

这时候除了分配entries个key内存空间,还要给每个lcore分配LCORE_CACHE_SIZE数量的key空间,那么此时key的数量是会大于哈希表的total
entry的,所以 rte_hash_free_key_with_position的"Out of bounds"判断逻辑有误。
2、position参数问题
position是rte_hash_del_key_xxx()的返回值,为 (key_idx - 1)。前面说过key的index
0未被使用,从1开始有效。那么直接将position enqueue到free_slot队列是不正确的:
rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)position));
这会导致ring队列里可能存在多个值为0的position,从而损坏ring队列。以ring的size是4为例:
ring初始状态:
|1 | 2 | 3 | 4 |
dequeue完所有key_idx后,返回position为 0,1,2,3,再enqueue回ring队列
一趟dequeue enqueue过后:
| 0 | 1 | 2 | 3 |
dequeue完所有key_idx后,返回position为 0,1,2(其中key_idx 0是无效的,不被使用),再enqueue回ring队列
两趟dequeue enqueue过后:
| 0 | 0 | 1 | 2 |

对比非lock-free的key回收接口,可以发现,remove_entry()是直接将key_indx入队的,所以不存在问题:
static inline void
remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
{
    unsigned lcore_id, n_slots;
    struct lcore_cache *cached_free_slots;

    if (h->use_local_cache) {
        lcore_id = rte_lcore_id();
        cached_free_slots = &h->local_free_slots[lcore_id];
        /* Cache full, need to free it. */
        if (cached_free_slots->len == LCORE_CACHE_SIZE) {
            /* Need to enqueue the free slots in global ring. */
            n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
                        cached_free_slots->objs,
                        LCORE_CACHE_SIZE, NULL);
            cached_free_slots->len -= n_slots;
        }
        /* Put index of new free slot in cache. */
        cached_free_slots->objs[cached_free_slots->len] =
                (void *)((uintptr_t)bkt->key_idx[i]);
        cached_free_slots->len++;
    } else {
        rte_ring_sp_enqueue(h->free_slots,
                (void *)((uintptr_t)bkt->key_idx[i]));
    }
}

修复方法
1、修改"Out of bounds" 的判断逻辑
2、position在enqueue回free_slots队列之前加1

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the dev mailing list