[dpdk-dev] [PATCH v2] mempool: improve cache search

Olivier MATZ olivier.matz at 6wind.com
Wed Jul 15 10:56:50 CEST 2015


Hi,

On 07/07/2015 07:17 PM, Zoltan Kiss wrote:
>
>
> On 02/07/15 18:07, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zoltan Kiss
>>> Sent: Wednesday, July 01, 2015 10:04 AM
>>> To: dev at dpdk.org
>>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search
>>>
>>> The current way has a few problems:
>>>
>>> - if cache->len < n, we copy our elements into the cache first, then
>>>    into obj_table, that's unnecessary
>>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>>    request from the ring alone, we don't try to combine with the cache
>>> - if refill fails, we don't return anything, even if the ring has enough
>>>    for our request
>>>
>>> This patch rewrites it severely:
>>> - at the first part of the function we only try the cache if
>>> cache->len < n
>>> - otherwise take our elements straight from the ring
>>> - if that fails but we have something in the cache, try to combine them
>>> - the refill happens at the end, and its failure doesn't modify our
>>> return
>>>    value
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss at linaro.org>
>>> ---
>>> v2:
>>> - fix subject
>>> - add unlikely for branch where request is fulfilled both from cache
>>> and ring
>>>
>>>   lib/librte_mempool/rte_mempool.h | 63
>>> +++++++++++++++++++++++++---------------
>>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.h
>>> b/lib/librte_mempool/rte_mempool.h
>>> index 6d4ce9a..1e96f03 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>       unsigned lcore_id = rte_lcore_id();
>>>       uint32_t cache_size = mp->cache_size;
>>>
>>> -    /* cache is not enabled or single consumer */
>>> +    cache = &mp->local_cache[lcore_id];
>>> +    /* cache is not enabled or single consumer or not enough */
>>>       if (unlikely(cache_size == 0 || is_mc == 0 ||
>>> -             n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>>> +             cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>>           goto ring_dequeue;
>>>
>>> -    cache = &mp->local_cache[lcore_id];
>>>       cache_objs = cache->objs;
>>>
>>> -    /* Can this be satisfied from the cache? */
>>> -    if (cache->len < n) {
>>> -        /* No. Backfill the cache first, and then fill from it */
>>> -        uint32_t req = n + (cache_size - cache->len);
>>> -
>>> -        /* How many do we require i.e. number to fill the cache +
>>> the request */
>>> -        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> &cache->objs[cache->len], req);
>>> -        if (unlikely(ret < 0)) {
>>> -            /*
>>> -             * In the offchance that we are buffer constrained,
>>> -             * where we are not able to allocate cache + n, go to
>>> -             * the ring directly. If that fails, we are truly out of
>>> -             * buffers.
>>> -             */
>>> -            goto ring_dequeue;
>>> -        }
>>> -
>>> -        cache->len += req;
>>> -    }
>>> -
>>>       /* Now fill in the response ... */
>>>       for (index = 0, len = cache->len - 1; index < n; ++index,
>>> len--, obj_table++)
>>>           *obj_table = cache_objs[len];
>>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>
>>>       __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> -    return 0;
>>> +    ret = 0;
>>> +    goto cache_refill;
>>>
>>>   ring_dequeue:
>>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> @@ -994,11 +975,45 @@ ring_dequeue:
>>>       else
>>>           ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +    if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
>>> +        uint32_t req = n - cache->len;
>>> +
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>>> +        if (ret == 0) {
>>> +            cache_objs = cache->objs;
>>> +            obj_table += req;
>>> +            for (index = 0; index < cache->len;
>>> +                 ++index, ++obj_table)
>>> +                *obj_table = cache_objs[index];
>>> +            cache->len = 0;
>>> +        }
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       if (ret < 0)
>>>           __MEMPOOL_STAT_ADD(mp, get_fail, n);
>>>       else
>>>           __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +cache_refill:
>>
>> Ok, so if I get things right: if the lcore runs out of entries in cache,
>> then on next __mempool_get_bulk() it has to do ring_dequeue() twice:
>> 1. to satisfy user request
>> 2. to refill the cache.
>> Right?
> Yes.
>
>> If that so, then I think the current approach:
>> ring_dequeue() once to refill the cache, then copy entries from the
>> cache to the user
>> is a cheaper(faster) one for many cases.
> But then you can't return anything if the refill fails, even if there
> would be enough in the ring (or ring+cache combined). Unless you retry
> with just n.
> __rte_ring_mc_do_dequeue is inlined, as far as I see the overhead of
> calling twice is:
> - check the number of entries in the ring, and atomic cmpset of
> cons.head again. This can loop if an other dequeue preceded us while
> doing that subtraction, but as that's a very short interval, I think
> it's not very likely
> - an extra rte_compiler_barrier()
> - wait for preceding dequeues to finish, and set cons.tail to the new
> value. I think this can happen often when 'n' has a big variation, so
> the previous dequeue can be easily much bigger
> - statistics update
>
> I guess if there is no contention on the ring the extra memcpy outweighs
> these easily. And my gut feeling says that contention around the two
> while loop should not be high unless, but I don't have hard facts.
> An another argument for doing two dequeue because we can do burst
> dequeue for the cache refill, which is better than only accepting the
> full amount.
>
> How about the following?
> If the cache can't satisfy the request, we do a dequeue from the ring to
> the cache for n + cache_size, but with rte_ring_mc_dequeue_burst. So it
> takes as many as it can, but doesn't fail if it can't take the whole.
> Then we copy from cache to obj_table, if there is enough.
> It makes sure we utilize as much as possible, with one ring dequeue.

Will it be possible to dequeue "n + cache_size"?
I think it would require to allocate some space to store the object
pointers, right? I don't feel it's a good idea to use a dynamic local
table (or alloca()) that depends on n.



>
>
>
>
>> Especially when same pool is shared between multiple threads.
>> For example when thread is doing RX only (no TX).
>>
>>
>>> +    /* If previous dequeue was OK and we have less than n, start
>>> refill */
>>> +    if (ret == 0 && cache_size > 0 && cache->len < n) {
>>> +        uint32_t req = cache_size - cache->len;
>>
>>
>> It could be that n > cache_size.
>> For that case, there probably no point to refill the cache, as you
>> took entrires from the ring
>> and cache was intact.
>
> Yes, it makes sense to add.
>>
>> Konstantin
>>
>>> +
>>> +        cache_objs = cache->objs;
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> +                           &cache->objs[cache->len],
>>> +                           req);
>>> +        if (likely(ret == 0))
>>> +            cache->len += req;
>>> +        else
>>> +            /* Don't spoil the return value */
>>> +            ret = 0;
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       return ret;
>>>   }
>>>
>>> --
>>> 1.9.1
>>



More information about the dev mailing list