[dpdk-dev] [RFC] mem: do not merge elems from different heap allocs

Burakov, Anatoly anatoly.burakov at intel.com
Tue Dec 4 17:49:34 CET 2018


On 29-Nov-18 7:13 PM, Jim Harris wrote:
> SPDK uses the rte_mem_event_callback_register API to
> create RDMA memory regions (MRs) for newly allocated regions
> of memory.  This is used in both the SPDK NVMe-oF target
> and the NVMe-oF host driver.
> 
> DPDK creates internal malloc_elem structures for these
> allocated regions.  As users malloc and free memory, DPDK
> will sometimes merge malloc_elems that originated from
> different allocations that were notified through the
> registered mem_event callback routine.  This results
> in subsequent allocations that can span across multiple
> RDMA MRs.  This requires SPDK to check each DPDK buffer to
> see if it crosses an MR boundary, and if so, would have to
> add considerable logic and complexity to describe that
> buffer before it can be accessed by the RNIC.  It is somewhat
> analagous to rte_malloc returning a buffer that is not
> IOVA-contiguous.
> 
> As a malloc_elem gets split and some of these elements
> get freed, it can also result in DPDK sending an
> RTE_MEM_EVENT_FREE notification for a subset of the
> original RTE_MEM_EVENT_ALLOC notification.  This is also
> problematic for RDMA memory regions, since unregistering
> the memory region is all-or-nothing.  It is not possible
> to unregister part of a memory region.
> 
> SPDK is currently working around this latter issue by
> setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the
> memory segments associated with an RTE_MEM_EVENT_ALLOC
> notification.  But the first issue with merged elements
> crossing MR boundaries is still possible and we have
> some rather simple allocation patterns that can
> trigger it.  So we would like to propose disabling the
> merging of malloc_elems that originate from different
> RTE_MEM_EVENT_ALLOC events.
> 
> This patch demonstrates how this merging could be
> avoided.  There are some allocation patterns (especially
> frequent multi-huge-page rte_mallocs and rte_frees) that
> could result in higher memory usage which need to be
> evaluated.  It's possible this behavior would need to be
> explicitly enabled through an init flag or a separate API,
> or implicitly based on whether any memory registration
> callbacks have been registered - implementation of one
> of those is deferred pending feedback on this RFC.
> 
> Signed-off-by: Jim Harris <james.r.harris at intel.com>

Hi Jim,

I can see the use case, and i don't there's a better way to do this than 
to fix it at the allocator stage.

You're correct in pointing out that some amount of memory will be wasted 
with this approach due to free space fragmentation. However, if we make 
this behavior non-default, that is fine.

One thing i'm a bit concerned about is that this change will push malloc 
element size into two cache lines any arch that is 64-bit and has a 
64-byte cache line. Malloc is not a performance API, so structure size 
is not important as long as it's cache-line aligned, but wasting 
additional 64 bytes per allocation on some architectures (including 
ours!) can be a concern.

Moreover, this extra overhead will be there regardless of whether this 
functionality is enabled or disabled (unless we're talking #ifdef-ery, 
which i'd rather not do). This overhead will even be there on FreeBSD, 
which doesn't support dynamic memory allocation in the first place. I 
don't see a good way to avoid this other than #ifdef-magic, but as i 
said, i'd really like to avoid having another config option (or worse, 
OS-specific code in common files).

I personally think additional cache line per allocation on x86_64 and 
similar archs isn't *that* big of a price to pay for this functionality 
(which i think is useful for certain use cases like yours!), but i'm 
adding Techboard folks to this thread in case i'm mistaken on that front :)

Provided we go forward with the proposal, i believe the best way to 
implement this would be through an EAL flag.

Also, some code comments below.

> ---
>   lib/librte_eal/common/malloc_elem.c | 14 ++++++++++----
>   lib/librte_eal/common/malloc_elem.h |  6 +++++-
>   lib/librte_eal/common/malloc_heap.c |  7 ++++++-
>   3 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
> index 9d3dcb6a9..785ecdadd 100644
> --- a/lib/librte_eal/common/malloc_elem.c
> +++ b/lib/librte_eal/common/malloc_elem.c
> @@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
>    */
>   void
>   malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
> -		struct rte_memseg_list *msl, size_t size)
> +		struct rte_memseg_list *msl, size_t size,
> +		struct malloc_elem *orig_elem, size_t orig_size)
>   {
>   	elem->heap = heap;
>   	elem->msl = msl;
> @@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
>   	elem->state = ELEM_FREE;
>   	elem->size = size;
>   	elem->pad = 0;
> +	elem->orig_elem = orig_elem;
> +	elem->orig_size = orig_size;
>   	set_header(elem);
>   	set_trailer(elem);
>   }
> @@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
>   	const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
>   	const size_t new_elem_size = elem->size - old_elem_size;
>   
> -	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
> +	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
> +			 elem->orig_elem, elem->orig_size);
>   	split_pt->prev = elem;
>   	split_pt->next = next_elem;
>   	if (next_elem)
> @@ -469,7 +473,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
>   	 * with it, need to remove from free list.
>   	 */
>   	if (elem->next != NULL && elem->next->state == ELEM_FREE &&
> -			next_elem_is_adjacent(elem)) {
> +			next_elem_is_adjacent(elem) &&
> +			elem->orig_elem == elem->next->orig_elem) {

Why not make this check part of adjacency test? IMO it would be much 
clearer that way.

>   		void *erase;
>   		size_t erase_len;
>   
> @@ -490,7 +495,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
>   	 * with it, need to remove from free list.
>   	 */
>   	if (elem->prev != NULL && elem->prev->state == ELEM_FREE &&
> -			prev_elem_is_adjacent(elem)) {
> +			prev_elem_is_adjacent(elem) &&
> +			elem->orig_elem == elem->prev->orig_elem) {

Same as above.

>   		struct malloc_elem *new_elem;
>   		void *erase;
>   		size_t erase_len;
> diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
> index e2bda4c02..207767c94 100644
> --- a/lib/librte_eal/common/malloc_elem.h
> +++ b/lib/librte_eal/common/malloc_elem.h
> @@ -32,6 +32,8 @@ struct malloc_elem {
>   	volatile enum elem_state state;
>   	uint32_t pad;
>   	size_t size;
> +	struct malloc_elem *orig_elem;
> +	size_t orig_size;
>   #ifdef RTE_MALLOC_DEBUG
>   	uint64_t header_cookie;         /* Cookie marking start of data */
>   	                                /* trailer cookie at start + size */

This changes struct size to two cache lines on x86_64, which breaks the 
malloc autotest. That's not the fault of this patchset (malloc autotest 
was kind of stupid in how it handled overhead anyway), but this can be 
fixed by detecting malloc overhead size at runtime. Something like the 
following:


static int
test_multi_alloc_statistics(void)
{
	<snip>
	struct rte_malloc_socket_stats pre_stats, post_stats;
	int overhead = 0;
	
	rte_malloc_get_socket_stats(socket, &pre_stats);

	/* allocate one cacheline */
	dummy = rte_zmalloc_socket(NULL, RTE_CACHE_LINE_SIZE, 0, socket);
	if (dummy == NULL)
		return -1;

	rte_malloc_get_socket_stats(socket, &post_stats);

	/* after subtracting cache line, remainder is overhead */
	overhead = post_stats.heap_allocsz_bytes - pre_stats.heap_allocsz_bytes;
	overhead -= RTE_CACHE_LINE_SIZE;

	rte_free(dummy);

	<proceed with test>
}

> @@ -116,7 +118,9 @@ void
>   malloc_elem_init(struct malloc_elem *elem,
>   		struct malloc_heap *heap,
>   		struct rte_memseg_list *msl,
> -		size_t size);
> +		size_t size,
> +		struct malloc_elem *orig_elem,
> +		size_t orig_size);
>   
>   void
>   malloc_elem_insert(struct malloc_elem *elem);
> diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
> index c6a6d4f6b..602528299 100644
> --- a/lib/librte_eal/common/malloc_heap.c
> +++ b/lib/librte_eal/common/malloc_heap.c
> @@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
>   {
>   	struct malloc_elem *elem = start;
>   
> -	malloc_elem_init(elem, heap, msl, len);
> +	malloc_elem_init(elem, heap, msl, len, elem, len);
>   
>   	malloc_elem_insert(elem);
>   
> @@ -857,6 +857,11 @@ malloc_heap_free(struct malloc_elem *elem)
>   	if (elem->size < page_sz)
>   		goto free_unlock;
>   
> +	/* we can only free memory back to the system as it was
> +	 * originally allocated */
> +	if (elem->size != elem->orig_size)
> +		goto free_unlock;
> +
>   	/* probably, but let's make sure, as we may not be using up full page */
>   	start = elem;
>   	len = elem->size;
> 


-- 
Thanks,
Anatoly


More information about the dev mailing list