[dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation

Olivier Matz olivier.matz at 6wind.com
Thu Dec 26 16:45:24 CET 2019

Previous message: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Next message: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Bao-Long,

On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> Hi,
> 
> I'm not sure if this is a bug, but I've seen an inconsistency in the behavior 
> of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> same mempool size, the number of hugepages allocated changes from run to run.
> 
> Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> 
> 1. Reserve 16x1G hugepages on socket 0 
> 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> make && ./build/basicfwd 
> 3. At the same time, watch the number of hugepages allocated 
> "watch -n.1 ls /dev/hugepages"
> 4. Repeat step 2
> 
> If you can reproduce, you should see that for some runs, DPDK allocates 5
> hugepages, other times it allocates 6. When it allocates 6, if you watch the 
> output from step 3., you will see that DPDK first  try to allocate 5 hugepages, 
> then unmap all 5, retry, and got 6.

I cannot reproduce in the same conditions than yours (with 16 hugepages
on socket 0), but I think I can see a similar issue:

If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
are used). If I reserve 5 hugepages, it takes more time,
taking/releasing hugepages several times, and it finally succeeds with 5
hugepages.

> For our use case, it's important that DPDK allocate the same number of 
> hugepages on every run so we can get reproducable results.

One possibility is to use the --legacy-mem EAL option. It will try to
reserve all hugepages first.

> Studying the code, this seems to be the behavior of
> rte_mempool_populate_default(). If I understand correctly, if the first try fail
> to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> condition, and eventually wound up with 6 hugepages.

No, I think you don't have the IOVA-contiguous constraint in your
case. This is what I see:

a- reserve 5 hugepages on socket 0, and start your patched basicfwd
b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
c- the total element size (with header) is 2304 + 64 = 2368
d- in rte_mempool_op_calc_mem_size_helper(), it calculates
   obj_per_page = 453438    (453438 * 2368 = 1073741184)
   mem_size = 4966058495
e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
   rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
     mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
     align=64)
   For some reason, it fails: we can see that the number of map'd hugepages
   increases in /dev/hugepages, the return to its original value.
   I don't think it should fail here.
f- then, it will try to allocate the biggest available contiguous zone. In
   my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
   This is a second problem: if we call it again, it returns NULL, because
   it won't map another hugepage.
g- by luck, calling rte_mempool_populate_virt() allocates a small aera
   (mempool header), and it triggers the mapping a a new hugepage, that
   will be used in the next loop, back at step d with a smaller mem_size.

> Questions: 
> 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory 
> is abundant? 

In my case, it looks that we have a bit less than 1G which is free at
the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
The allocator ends up in mapping 5 pages (and fail), while only 4 is
needed.

Anatoly, do you have any idea? Shouldn't we take in account the amount
of free space at the end of the heap when expanding?

> 2. Why does the 2nd retry need N+1 hugepages?

When the first alloc fails, the mempool code tries to allocate in
several chunks which are not virtually contiguous. This is needed in
case the memory is fragmented.

> Some insights for Q1: From my experiments, seems like the IOVA of the first
> hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> It could explain the retry when the IOVA of the first hugepage is near the end of 
> the IOVA space. But I have also seen situation where the 1st hugepage is near
> the beginning of the IOVA space and it still failed the 1st time.
> 
> Here's the code:
> #include <rte_eal.h>
> #include <rte_mbuf.h>
> 
> int
> main(int argc, char *argv[])
> {
> 	struct rte_mempool *mbuf_pool;
> 	unsigned mbuf_pool_size = 2097151;
> 
> 	int ret = rte_eal_init(argc, argv);
> 	if (ret < 0)
> 		rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> 
> 	printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> 	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> 		256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> 
> 	printf("mbuf_pool %p\n", mbuf_pool);
> 
> 	return 0;
> }
> 
> Best regards,
> BL

Regards,
Olivier

Previous message: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Next message: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list