[dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create single mem-backed file
Yuanhan Liu
yuanhan.liu at linux.intel.com
Tue Mar 8 03:44:37 CET 2016
On Tue, Mar 08, 2016 at 09:55:10AM +0800, Tan, Jianfeng wrote:
> Hi Yuanhan,
>
> On 3/7/2016 9:13 PM, Yuanhan Liu wrote:
> >CC'ed EAL hugepage maintainer, which is something you should do when
> >send a patch.
>
> Thanks for doing this.
>
> >
> >On Fri, Feb 05, 2016 at 07:20:24PM +0800, Jianfeng Tan wrote:
> >>Originally, there're two cons in using hugepage: a. needs root
> >>privilege to touch /proc/self/pagemap, which is a premise to
> >>alllocate physically contiguous memseg; b. possibly too many
> >>hugepage file are created, especially used with 2M hugepage.
> >>
> >>For virtual devices, they don't care about physical-contiguity
> >>of allocated hugepages at all. Option --single-file is to
> >>provide a way to allocate all hugepages into single mem-backed
> >>file.
> >>
> >>Known issue:
> >>a. single-file option relys on kernel to allocate numa-affinitive
> >>memory.
> >>b. possible ABI break, originally, --no-huge uses anonymous memory
> >>instead of file-backed way to create memory.
> >>
> >>Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> >>Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> >...
> >>@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
> >> "be specified together with --"OPT_NO_HUGE"\n");
> >> return -1;
> >> }
> >>+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
> >>+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
> >>+ "be specified together with --"OPT_SOCKET_MEM"\n");
> >>+ return -1;
> >>+ }
> >>+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
> >>+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
> >>+ "be specified together with --"OPT_SINGLE_FILE"\n");
> >>+ return -1;
> >>+ }
> >The two limitation doesn't make sense to me.
>
> For the force_sockets option, my original thought on --single-file option
> is, we don't sort those pages (require root/cap_sys_admin) and even don't
> look up numa information because it may contain both sockets' memory.
>
> For the hugepage_unlink option, those hugepage files get closed in the end
> of memory initialization, if we even unlink those hugepage files, so we
> cannot share those with other processes (say backend).
Yeah, I know how the two limitations come, from your implementation. I
was just wondering if they both are __truly__ the limitations. I mean,
can we get rid of them somehow?
For --socket-mem option, if we can't handle it well, or if we could
ignore the socket_id for allocated huge page, yes, the limitation is
a true one.
But for the second option, no, we should be able to co-work it with
well. One extra action is you should not invoke "close(fd)" for those
huge page files. And then you can get all the informations as I stated
in a reply to your 2nd patch.
> >
> >>diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>index 6008533..68ef49a 100644
> >>--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
> >> /* get pointer to global configuration */
> >> mcfg = rte_eal_get_configuration()->mem_config;
> >>- /* hugetlbfs can be disabled */
> >>- if (internal_config.no_hugetlbfs) {
> >>- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
> >>- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> >>+ /* when hugetlbfs is disabled or single-file option is specified */
> >>+ if (internal_config.no_hugetlbfs || internal_config.single_file) {
> >>+ int fd;
> >>+ uint64_t pagesize;
> >>+ unsigned socket_id = rte_socket_id();
> >>+ char filepath[MAX_HUGEPAGE_PATH];
> >>+
> >>+ if (internal_config.no_hugetlbfs) {
> >>+ eal_get_hugefile_path(filepath, sizeof(filepath),
> >>+ "/dev/shm", 0);
> >>+ pagesize = RTE_PGSIZE_4K;
> >>+ } else {
> >>+ struct hugepage_info *hpi;
> >>+
> >>+ hpi = &internal_config.hugepage_info[0];
> >>+ eal_get_hugefile_path(filepath, sizeof(filepath),
> >>+ hpi->hugedir, 0);
> >>+ pagesize = hpi->hugepage_sz;
> >>+ }
> >>+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
> >>+ if (fd < 0) {
> >>+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
> >>+ __func__, filepath, strerror(errno));
> >>+ return -1;
> >>+ }
> >>+
> >>+ if (ftruncate(fd, internal_config.memory) < 0) {
> >>+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
> >>+ filepath, strerror(errno));
> >>+ return -1;
> >>+ }
> >>+
> >>+ addr = mmap(NULL, internal_config.memory,
> >>+ PROT_READ | PROT_WRITE,
> >>+ MAP_SHARED | MAP_POPULATE, fd, 0);
> >> if (addr == MAP_FAILED) {
> >>- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
> >>- strerror(errno));
> >>+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
> >>+ __func__, strerror(errno));
> >> return -1;
> >> }
> >> mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
> >> mcfg->memseg[0].addr = addr;
> >>- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
> >>+ mcfg->memseg[0].hugepage_sz = pagesize;
> >> mcfg->memseg[0].len = internal_config.memory;
> >>- mcfg->memseg[0].socket_id = 0;
> >>+ mcfg->memseg[0].socket_id = socket_id;
> >I saw quite few issues:
> >
> >- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
> > mounted at /dev/hugepages and /mnt, respectively.
> >
> > Here we then got an 5G internal_config.memory, and your code will
> > try to mmap 5G on the first mount point (/dev/hugepages) due to the
> > hardcode logic in your code:
> >
> > hpi = &internal_config.hugepage_info[0];
> > eal_get_hugefile_path(filepath, sizeof(filepath),
> > hpi->hugedir, 0);
> >
> > But it has 4G in total, therefore, it will fails.
>
> As mentioned above, this case is not for original design of --single-file.
But it's a so common case, isn't it?
> >
> >- As you stated, socket_id is hardcoded, which could be wrong.
>
> We rely on OS to allocate hugepages, and cannot promise physical hugepages
> in the big hugepage file are from the same socket.
>
> >
> >- As stated in above, the option limitation doesn't seem right to me.
> >
> > I mean, --single-file should be able to work with --socket-mem option
> > in semantic.
>
> If we'd like to work well with --socket-mem option, we need to use syscalls
> like set_mempolicy(), mbind(). So it'll bring bigger change related to
> current one. I don't know if it's acceptable?
Yes, if that's the right way to go. But also as you stated, I doubt we
really need handle the numa affinitive here, due to it's complex.
> >
> >
> >And I have been thinking how to deal with those issues properly, and a
> >__very immature__ solution come to my mind (which could be simply not
> >working), but anyway, here is FYI: we go through the same process to
> >handle normal huge page initilization to --single-file option as well.
> >But we take different actions or no actions at all at some stages when
> >that option is given, which is a bit similiar with the way of handling
> >RTE_EAL_SINGLE_FILE_SEGMENTS.
> >
> >And we create one hugepage file for each node, each page size. For a
> >system like mine above (2 nodes), it may populate files like following:
> >
> >- 1G x 2 on node0
> >- 1G x 2 on node1
> >- 2M x 256 on node0
> >- 2M x 256 on node1
> >
> >That could normally fit your case. Though 4 nodes looks like the maximum
> >node number, --socket-mem option may relieve the limit a bit.
> >
> >And if we "could" not care the socket_id being set correctly, we could
> >just simply allocate one file for each hugepage size. That would work
> >well for your container enabling.
>
> This way seems a good option at first sight. Let's compare this new way with
> original design.
>
> The original design just covers the simplest scenario:
> a. just one hugetlbfs (new way can provide support for multiple number of
> hugetlbfs)
> b. does not require a root privilege (new way can achieve this by using
> above-mentioned mind() or set_mempolicy() syscall)
> c. no sorting (both way are OK)
> d. performance, from the perspective of virtio for container, we take more
> consideration about the performance of address translation in the vhost. In
> the vhost, now we adopt a O(n) linear comparison to translate address (this
> can be optimized to O(logn) using segment tree, or even better using a
> cache, sorry, it's just another problem), so we should maintain as few files
> as possible. (new way can achieve this by used with --socket-mem,
> --huge-dir)
> e. numa aware is not required (and it's complex). (new way can solve this
> without promise)
>
> In all, this new way seems great for me.
>
> Another thing is if "go through the same process to handle normal huge page
> initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS goes such
> way to maximize code reuse. But the new way has few common code with
> original ways. And mixing these options together leads to bad readability.
> How do you think?
Indeed. I've already found that the code is a bit hard to read, due to
many "#ifdef ... #else .. #endif" blocks, for RTE_EAL_SINGLE_FILE_SEGMENTS
as well as some special archs.
Therefore, I would suggest to do it as below: add another option based
on the SINGLE_FILE_SEGMENTS implementation.
I mean SINGLE_FILE_SEGMENTS already tries to generate as few files as
possible. If we add another option, say --no-sort (or --no-phys-continuity),
we could add just few lines of code to let it generate one file for
each huge page size (if we don't consider the numa affinity).
> >
> >BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
> >another option --single-file looks really confusing to me.
> >
> >To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
> >another option, say --no-sort (I confess this name sucks, but you get
> >my point). With that, we could make sure to create as least huge page
> >files as possible, to fit your case.
>
> This is a great advice. So how do you think of --converged, or
> --no-scattered-mem, or any better idea?
TBH, none of them looks great to me, either. But I have no better
options. Well, --no-phys-continuity looks like the best option to
me so far :)
--yliu
>
> Thanks for valuable input.
>
> Jianfeng
>
> >
> > --yliu
More information about the dev
mailing list