[dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation

Olivier Matz olivier.matz at 6wind.com
Thu Jan 9 14:32:34 CET 2020


On Tue, Jan 07, 2020 at 01:06:01PM +0000, Burakov, Anatoly wrote:
> On 27-Dec-19 11:11 AM, Olivier Matz wrote:
> > Hi Bao-Long,
> > 
> > On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote:
> > > Hi Olivier,
> > > 
> > > > On 27 Dec 2019, at 4:11 PM, Olivier Matz <olivier.matz at 6wind.com> wrote:
> > > > 
> > > > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote:
> > > > > Hi Bao-Long,
> > > > > 
> > > > > On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior
> > > > > > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the
> > > > > > same mempool size, the number of hugepages allocated changes from run to run.
> > > > > > 
> > > > > > Here's how I reproduce with DPDK 19.11. IOVA=pa (default)
> > > > > > 
> > > > > > 1. Reserve 16x1G hugepages on socket 0
> > > > > > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run
> > > > > > make && ./build/basicfwd
> > > > > > 3. At the same time, watch the number of hugepages allocated
> > > > > > "watch -n.1 ls /dev/hugepages"
> > > > > > 4. Repeat step 2
> > > > > > 
> > > > > > If you can reproduce, you should see that for some runs, DPDK allocates 5
> > > > > > hugepages, other times it allocates 6. When it allocates 6, if you watch the
> > > > > > output from step 3., you will see that DPDK first  try to allocate 5 hugepages,
> > > > > > then unmap all 5, retry, and got 6.
> > > > > 
> > > > > I cannot reproduce in the same conditions than yours (with 16 hugepages
> > > > > on socket 0), but I think I can see a similar issue:
> > > > > 
> > > > > If I reserve at least 6 hugepages, it seems reproducible (6 hugepages
> > > > > are used). If I reserve 5 hugepages, it takes more time,
> > > > > taking/releasing hugepages several times, and it finally succeeds with 5
> > > > > hugepages.
> > > 
> > > My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master.
> > > Let me try to see if I can repro my issue with 19.11. Sorry for the confusion.
> > > 
> > > I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves
> > > the problem with the IOVA-contig constraint that I described in my first message.
> > > I'll look into it to confirm.
> > > 
> > > If I cannot repro my issue (different number of hugepages) with 19.11, from our
> > > side we can upgrade to 19.11 and that's all we need for now. But let me also try
> > > to repro the issue you described (multiple attempts to allocate hugepages).
> > 
> > OK, thanks.
> > 
> > Anyway, I think there is an issue on 19.11. And it is is even worse with
> > 2M hugepages. Let's say we reserve 500x 2M hugepages, and try to
> > allocate a mempool of 5G:
> > 
> > 1/ mempool_populate tries to allocate in one virtually contiguous block,
> >     which maps all 500 hugepages, then fail, unmapping them
> > 2/ it tries to allocate the largest zone, which returns ~2MB.
> > 3/ this zone is added to the mempool, and for that, it allocates a
> >     mem_header struct, which triggers the mapping of a new page.
> > 4/ Back to 1... until it fails after 3 mins
> > 
> > The memzone allocation of "largest available area" does not have the
> > same semantic depending on the memory model (pre-mapped hugepages or
> > not). When using dynamic hugepage mapping, it won't map any additional
> > hugepage.
> > 
> > To solve the issue, we could either change it to allocate all available
> > hugepages, or change mempool populate, by not using the "largest
> > available area" allocation, doing the search by ourself.
> 
> Yep, this is one of the things that is currently an unsolved problem in the
> allocator. I am not sure if any one behavior is "more correct" than the
> other, so i don't think allocating "all available" hugepages is more correct
> than not doing it.
> 
> Besides, there's no reliable way to get "biggest" chunk of memory, because
> while you might get *some* memory from 2M pages, there's no guarantee that
> the amount you may get from 1G pages isn't bigger. So, we either momentarily
> take over the entire users' memory and figure out what we need and what we
> don't, or we use the first available page size and hope that that's enough.
> 
> That said, there's an internal API to allocate "up to X" pages, so in
> principle, we could build this kind of infrastructure.

I tried to solve the issue in mempool, without using the memzone_alloc(size=0)
feature. See https://patches.dpdk.org/patch/64370/

> 
> > 
> > 
> > > 
> > > > > 
> > > > > > For our use case, it's important that DPDK allocate the same number of
> > > > > > hugepages on every run so we can get reproducable results.
> > > > > 
> > > > > One possibility is to use the --legacy-mem EAL option. It will try to
> > > > > reserve all hugepages first.
> > > > 
> > > > Passing --socket-mem=5120,0 also does the job.
> > > > 
> > > 
> > > > > > Studying the code, this seems to be the behavior of
> > > > > > rte_mempool_populate_default(). If I understand correctly, if the first try fail
> > > > > > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous
> > > > > > condition, and eventually wound up with 6 hugepages.
> > > > > 
> > > > > No, I think you don't have the IOVA-contiguous constraint in your
> > > > > case. This is what I see:
> > > > > 
> > > > > a- reserve 5 hugepages on socket 0, and start your patched basicfwd
> > > > > b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824
> > > > > c- the total element size (with header) is 2304 + 64 = 2368
> > > > > d- in rte_mempool_op_calc_mem_size_helper(), it calculates
> > > > >    obj_per_page = 453438    (453438 * 2368 = 1073741184)
> > > > >    mem_size = 4966058495
> > > > > e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with:
> > > > >    rte_memzone_reserve_aligned(name, size=4966058495, socket=0,
> > > > >      mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY,
> > > > >      align=64)
> > > > >    For some reason, it fails: we can see that the number of map'd hugepages
> > > > >    increases in /dev/hugepages, the return to its original value.
> > > > >    I don't think it should fail here.
> > > > > f- then, it will try to allocate the biggest available contiguous zone. In
> > > > >    my case, it is 1055291776 bytes (almost all the uniq map'd hugepage).
> > > > >    This is a second problem: if we call it again, it returns NULL, because
> > > > >    it won't map another hugepage.
> > > > > g- by luck, calling rte_mempool_populate_virt() allocates a small aera
> > > > >    (mempool header), and it triggers the mapping a a new hugepage, that
> > > > >    will be used in the next loop, back at step d with a smaller mem_size.
> > > > > 
> > > 
> > > > > > Questions:
> > > > > > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory
> > > > > > is abundant?
> > > > > 
> > > > > In my case, it looks that we have a bit less than 1G which is free at
> > > > > the end of the heap, than we call rte_memzone_reserve_aligned(size=5G).
> > > > > The allocator ends up in mapping 5 pages (and fail), while only 4 is
> > > > > needed.
> > > > > 
> > > > > Anatoly, do you have any idea? Shouldn't we take in account the amount
> > > > > of free space at the end of the heap when expanding?
> > > > > 
> > > > > > 2. Why does the 2nd retry need N+1 hugepages?
> > > > > 
> > > > > When the first alloc fails, the mempool code tries to allocate in
> > > > > several chunks which are not virtually contiguous. This is needed in
> > > > > case the memory is fragmented.
> > > > > 
> > > > > > Some insights for Q1: From my experiments, seems like the IOVA of the first
> > > > > > hugepage is not guaranteed to be at the start of the IOVA space (understandably).
> > > > > > It could explain the retry when the IOVA of the first hugepage is near the end of
> > > > > > the IOVA space. But I have also seen situation where the 1st hugepage is near
> > > > > > the beginning of the IOVA space and it still failed the 1st time.
> > > > > > 
> > > > > > Here's the code:
> > > > > > #include <rte_eal.h>
> > > > > > #include <rte_mbuf.h>
> > > > > > 
> > > > > > int
> > > > > > main(int argc, char *argv[])
> > > > > > {
> > > > > > 	struct rte_mempool *mbuf_pool;
> > > > > > 	unsigned mbuf_pool_size = 2097151;
> > > > > > 
> > > > > > 	int ret = rte_eal_init(argc, argv);
> > > > > > 	if (ret < 0)
> > > > > > 		rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
> > > > > > 
> > > > > > 	printf("Creating mbuf pool size=%u\n", mbuf_pool_size);
> > > > > > 	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size,
> > > > > > 		256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0);
> > > > > > 
> > > > > > 	printf("mbuf_pool %p\n", mbuf_pool);
> > > > > > 
> > > > > > 	return 0;
> > > > > > }
> > > > > > 
> > > > > > Best regards,
> > > > > > BL
> > > > > 
> > > > > Regards,
> > > > > Olivier
> > > 
> > > Thanks,
> > > BL
> > > 
> > 
> 
> 
> -- 
> Thanks,
> Anatoly


More information about the dev mailing list