[dpdk-dev] [PATCH 00/36] mempool: rework memory allocation

Wiles, Keith keith.wiles at intel.com
Thu Apr 14 15:50:46 CEST 2016


>This series is a rework of mempool. For those who don't want to read
>all the cover letter, here is a sumary:
>
>- it is not possible to allocate large mempools if there is not enough
>  contiguous memory, this series solves this issue
>- introduce new APIs with less arguments: "create, populate, obj_init"
>- allow to free a mempool
>- split code in smaller functions, will ease the introduction of ext_handler
>- remove test-pmd anonymous mempool creation
>- remove most of dom0-specific mempool code
>- opens the door for a eal_memory rework: we probably don't need large
>  contiguous memory area anymore, working with pages would work.
>
>This breaks the ABI as it was indicated in the deprecation for 16.04.
>The API stays almost the same, no modification is needed in examples app
>or in test-pmd. Only kni and mellanox drivers are slightly modified.
>
>This patch applies on top of 16.04 + v5 of Keith's patch:
>"mempool: reduce rte_mempool structure size"

I have not digested this complete patch yet, but this one popped out at me as the External Memory Manager support is setting in the wings for 16.07 release. If this causes the EMM patch to be rewritten or updated that seems like a problem to me. Does this patch add the External Memory Manager support?
http://thread.gmane.org/gmane.comp.networking.dpdk.devel/32015/focus=35107


>
>Changes RFC -> v1:
>
>- remove the rte_deconst macro, and remove some const qualifier in
>  dump/audit functions
>- rework modifications in mellanox drivers to ensure the mempool is
>  virtually contiguous
>- fix mempool memory chunk iteration (bad pointer was used)
>- fix compilation on freebsd: replace MAP_LOCKED flag by mlock()
>- fix compilation on tilera (pointer arithmetics)
>- slightly rework and clean the mempool autotest
>- fix mempool autotest on bsd
>- more validation (especially mellanox drivers and kni that were not
>  tested in RFC)
>- passed autotests (x86_64-native-linuxapp-gcc and x86_64-native-bsdapp-gcc)
>- rebase on head, reorder the patches a bit and fix minor split issues
>
>
>Description of the initial issue
>--------------------------------
>
>The allocation of mbuf pool can fail even if there is enough memory.
>The problem is related to the way the memory is allocated and used in
>dpdk. It is particularly annoying with mbuf pools, but it can also fail
>in other use cases allocating a large amount of memory.
>
>- rte_malloc() allocates physically contiguous memory, which is needed
>  for mempools, but useless most of the time.
>
>  Allocating a large physically contiguous zone is often impossible
>  because the system provide hugepages which may not be contiguous.
>
>- rte_mempool_create() (and therefore rte_pktmbuf_pool_create())
>  requires a physically contiguous zone.
>
>- rte_mempool_xmem_create() does not solve the issue as it still
>  needs the memory to be virtually contiguous, and there is no
>  way in dpdk to allocate a virtually contiguous memory that is
>  not also physically contiguous.
>
>How to reproduce the issue
>--------------------------
>
>- start the dpdk with some 2MB hugepages (it can also occur with 1GB)
>- allocate a large mempool
>- even if there is enough memory, the allocation can fail
>
>Example:
>
>  git clone http://dpdk.org/git/dpdk
>  cd dpdk
>  make config T=x86_64-native-linuxapp-gcc
>  make -j32
>  mkdir -p /mnt/huge
>  mount -t hugetlbfs nodev /mnt/huge
>  echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>
>  # we try to allocate a mempool whose size is ~450MB, it fails
>  ./build/app/testpmd -l 2,4 -- --total-num-mbufs=200000 -i
>
>The EAL logs "EAL: Virtual area found at..." shows that there are
>several zones, but all smaller than 450MB.
>
>Workarounds:
>
>- Use 1GB hugepages: it sometimes work, but for very large
>  pools (millions of mbufs) there is the same issue. Moreover,
>  it would consume 1GB memory at least which can be a lot
>  in some cases.
>
>- Reboot the machine or allocate hugepages at boot time: this increases
>  the chances to have more contiguous memory, but does not completely
>  solve the issue
>
>Solutions
>---------
>
>Below is a list of proposed solutions. I implemented a quick and dirty
>PoC of solution 1, but it's not working in all conditions and it's
>really an ugly hack.  This series implement the solution 4 which looks
>the best to me, knowing it does not prevent to do more enhancements
>in dpdk memory in the future (solution 3 for instance).
>
>Solution 1: in application
>--------------------------
>
>- allocate several hugepages using rte_malloc() or rte_memzone_reserve()
>  (only keeping complete hugepages)
>- parse memsegs and /proc/maps to check which files mmaps these pages
>- mmap the files in a contiguous virtual area
>- use rte_mempool_xmem_create()
>
>Cons:
>
>- 1a. parsing the memsegs of rte config in the application does not
>  use a public API, and can be broken if internal dpdk code changes
>- 1b. some memory is lost due to malloc headers. Also, if the memory is
>  very fragmented (ex: all 2MB pages are physically separated), it does
>  not work at all because we cannot get any complete page. It is not
>  possible to use a lower level allocator since commit fafcc11985a.
>- 1c. we cannot use rte_pktmbuf_pool_create(), so we need to use mempool
>  api and do a part of the job manually
>- 1d. it breaks secondary processes as the virtual addresses won't be
>  mmap'd at the same place in secondary process
>- 1e. it only fixes the issue for the mbuf pool of the application,
>  internal pools in dpdk libraries are not modified
>- 1f. this is a pure linux solution (rte_map files)
>- 1g. The application has to be aware of RTE_EAL_SINGLE_SEGMENTS option
>  that changes the way hugepages are mapped. By the way, it's strange
>  to have such a compile-time option, we should probably have only
>  one behavior that works all the time.
>
>Solution 2: in dpdk memory allocator
>------------------------------------
>
>- do the same than solution 1 in a new function rte_malloc_non_contig():
>  allocate several chunks and mmap them in a contiguous virtual memory
>- a flag has to be added in malloc header to do the proper cleanup in
>  rte_free() (free all the chunks, munmap the memory)
>- introduce a new rte_mem_get_physmap(*physmap,addr, len) that returns
>  the virt2phys mapping of a virtual area in dpdk
>- add a mempool flag MEMPOOL_F_NON_PHYS_CONTIG to use
>  rte_malloc_non_contig() to allocate the area storing the objects
>
>Cons:
>
>- 2a. same than 1b: it breaks secondary processes if the mempool flag is
>  used.
>- 2b. same as 1d: some memory is lost due to malloc headers, and it
>  cannot work if memory is too fragmented.
>- 2c. rte_malloc_virt2phy() cannot be used on these zones. It would
>  return the physical address of the first page. It would be better to
>  return an error in this case.
>- 2d. need to check how to implement this on bsd (TBD)
>
>Solution 3: in dpdk eal memory
>------------------------------
>
>- Rework the way hugepages are mmap'd in dpdk: instead of having several
>  rte_map* files, just mmap one file per node. It may drastically
>  simplify EAL memory management in dpdk.
>- An API should be added to retrieve the physical mapping of a virtual
>  area (ex: rte_mem_get_physmap(*physmap, addr, len))
>- rte_malloc() and rte_memzone_reserve() won't allocate physically
>  contiguous memory anymore (TBD)
>- Update mempool to always use the rte_mempool_xmem_create() version
>
>Cons:
>
>- 3a. lot of rework in eal memory, it will induce some behavior changes
>  and maybe api changes
>- 3b. possible conflicts with xen_dom0 mempool
>
>Solution 4: in mempool
>----------------------
>
>- Introduce a new API to fill a mempool with zones that are not
>  virtually contiguous. It requires to add new functions to create and
>  populate a mempool. Example (TBD):
>
>  - rte_mempool_create_empty(name, n, elt_size, cache_size, priv_size)
>  - rte_mempool_populate(mp, addr, len): add virtual memory for objects
>  - rte_mempool_mempool_obj_iter(mp, obj_cb, arg): call a cb for each object
>
>- update rte_mempool_create() to allocate objects in several memory
>  chunks by default if there is no large enough physically contiguous
>  memory.
>
>Tests done
>----------
>
>Compilation
>~~~~~~~~~~~
>
>The following targets:
>
> x86_64-native-linuxapp-gcc
> i686-native-linuxapp-gcc
> x86_x32-native-linuxapp-gcc
> x86_64-native-linuxapp-clang
> x86_64-native-bsdapp-gcc
> ppc_64-power8-linuxapp-gcc
> tile-tilegx-linuxapp-gcc (only the mempool files, the target does not compile)
>
>Libraries with and without debug, in static and shared mode + examples.
>
>autotests
>~~~~~~~~~
>
>Passed all autotests on x86_64-native-linuxapp-gcc (including kni) and
>mempool-related autotests on x86_64-native-bsdapp-gcc.
>
>test-pmd
>~~~~~~~~
>
># now starts fine, was failing before if mempool was too fragmented
>./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -- -i --port-topology=chained
>
># still ok
>./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -m 256 -- -i --port-topology=chained --mp-anon
>set fwd txonly
>start
>stop
>
># fail, but was failing before too. The problem is because the physical
># addresses are not properly set when using --no-huge. The mempool phys addr
># are now correct, but the zones allocated through memzone_reserve() are
># still wrong. This could be fixed in a future series.
>./x86_64-native-linuxapp-gcc/app/testpmd -l 0,2,4 -n 4 -m 256 --no-huge -- -i ---port-topology=chained
>set fwd txonly
>start
>stop
>
>
>Olivier Matz (36):
>  mempool: fix comments and style
>  mempool: replace elt_size by total_elt_size
>  mempool: uninline function to check cookies
>  mempool: use sizeof to get the size of header and trailer
>  mempool: rename mempool_obj_ctor_t as mempool_obj_cb_t
>  mempool: update library version
>  mempool: list objects when added in the mempool
>  mempool: remove const attribute in mempool_walk
>  mempool: remove const qualifier in dump and audit
>  mempool: use the list to iterate the mempool elements
>  mempool: use the list to audit all elements
>  mempool: use the list to initialize mempool objects
>  mempool: create the internal ring in a specific function
>  mempool: store physaddr in mempool objects
>  mempool: remove MEMPOOL_IS_CONTIG()
>  mempool: store memory chunks in a list
>  mempool: new function to iterate the memory chunks
>  mempool: simplify xmem_usage
>  mempool: introduce a free callback for memory chunks
>  mempool: make page size optional when getting xmem size
>  mempool: default allocation in several memory chunks
>  eal: lock memory when using no-huge
>  mempool: support no-hugepage mode
>  mempool: replace mempool physaddr by a memzone pointer
>  mempool: introduce a function to free a mempool
>  mempool: introduce a function to create an empty mempool
>  eal/xen: return machine address without knowing memseg id
>  mempool: rework support of xen dom0
>  mempool: create the internal ring when populating
>  mempool: populate a mempool with anonymous memory
>  mempool: make mempool populate and free api public
>  test-pmd: remove specific anon mempool code
>  mem: avoid memzone/mempool/ring name truncation
>  mempool: new flag when phys contig mem is not needed
>  app/test: rework mempool test
>  mempool: update copyright
>
> app/test-pmd/Makefile                        |    4 -
> app/test-pmd/mempool_anon.c                  |  201 -----
> app/test-pmd/mempool_osdep.h                 |   54 --
> app/test-pmd/testpmd.c                       |   23 +-
> app/test/test_mempool.c                      |  243 +++---
> doc/guides/rel_notes/release_16_04.rst       |    2 +-
> drivers/net/mlx4/mlx4.c                      |  140 ++--
> drivers/net/mlx5/mlx5_rxtx.c                 |  140 ++--
> drivers/net/mlx5/mlx5_rxtx.h                 |    4 +-
> drivers/net/xenvirt/rte_eth_xenvirt.h        |    2 +-
> drivers/net/xenvirt/rte_mempool_gntalloc.c   |    4 +-
> lib/librte_eal/common/eal_common_log.c       |    2 +-
> lib/librte_eal/common/eal_common_memzone.c   |   10 +-
> lib/librte_eal/common/include/rte_memory.h   |   11 +-
> lib/librte_eal/linuxapp/eal/eal_memory.c     |    2 +-
> lib/librte_eal/linuxapp/eal/eal_xen_memory.c |   17 +-
> lib/librte_kni/rte_kni.c                     |   12 +-
> lib/librte_mempool/Makefile                  |    5 +-
> lib/librte_mempool/rte_dom0_mempool.c        |  133 ----
> lib/librte_mempool/rte_mempool.c             | 1042 +++++++++++++++++---------
> lib/librte_mempool/rte_mempool.h             |  594 +++++++--------
> lib/librte_mempool/rte_mempool_version.map   |   18 +-
> lib/librte_ring/rte_ring.c                   |   16 +-
> 23 files changed, 1377 insertions(+), 1302 deletions(-)
> delete mode 100644 app/test-pmd/mempool_anon.c
> delete mode 100644 app/test-pmd/mempool_osdep.h
> delete mode 100644 lib/librte_mempool/rte_dom0_mempool.c
>
>-- 
>2.1.4
>
>


Regards,
Keith






More information about the dev mailing list