[dpdk-dev] [RFC 4/5] virtio/container: adjust memory initialization process

Tan, Jianfeng jianfeng.tan at intel.com
Sun Nov 8 12:18:12 CET 2015



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Saturday, November 7, 2015 12:22 AM
> To: Tan, Jianfeng; dev at dpdk.org
> Cc: nakajima.yoshihiro at lab.ntt.co.jp; zhbzg at huawei.com; mst at redhat.com;
> gaoxiaoqiu at huawei.com; oscar.zhangbo at huawei.com;
> ann.zhuangyanying at huawei.com; zhoujingbin at huawei.com;
> guohongzhen at huawei.com
> Subject: RE: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory
> initialization process
> 
> Hi,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Jianfeng Tan
> > Sent: Thursday, November 05, 2015 6:31 PM
> > To: dev at dpdk.org
> > Cc: nakajima.yoshihiro at lab.ntt.co.jp; zhbzg at huawei.com;
> > mst at redhat.com; gaoxiaoqiu at huawei.com; oscar.zhangbo at huawei.com;
> > ann.zhuangyanying at huawei.com; zhoujingbin at huawei.com;
> > guohongzhen at huawei.com
> > Subject: [dpdk-dev] [RFC 4/5] virtio/container: adjust memory
> > initialization process
> >
> > When using virtio for container, we should specify --no-huge so that
> > in memory initialization, shm_open() is used to alloc memory from
> > tmpfs filesystem /dev/shm/.
> >
> > Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> > Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> > ---
......
> > +int
> > +rte_memseg_info_get(int index, int *pfd, uint64_t *psize, void
> > +**paddr) {
> > +	struct rte_mem_config *mcfg;
> > +	mcfg = rte_eal_get_configuration()->mem_config;
> > +
> > +	*pfd = mcfg->memseg[index].fd;
> > +	*psize = (uint64_t)mcfg->memseg[index].len;
> > +	*paddr = (void *)(uint64_t)mcfg->memseg[index].addr;
> > +	return 0;
> > +}
> 
> Wonder who will use that function?
> Can't see any references to that function in that patch or next.

This function is used in 1/5, when virtio front end needs to send VHOST_USER_SET_MEM_TABLE to back end.

> > +
> >  /*
> >   * Get physical address of any mapped virtual address in the current
> process.
> >   */
> > @@ -1044,6 +1059,42 @@ calc_num_pages_per_socket(uint64_t *
> memory,
> >  	return total_num_pages;
> >  }
> >
> > +static void *
> > +rte_eal_shm_create(int *pfd)
> > +{
> > +	int ret, fd;
> > +	char filepath[256];
> > +	void *vaddr;
> > +	uint64_t size = internal_config.memory;
> > +
> > +	sprintf(filepath, "/%s_cvio", internal_config.hugefile_prefix);
> > +
> > +	fd = shm_open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
> > +	if (fd < 0) {
> > +		rte_panic("shm_open %s failed: %s\n", filepath,
> strerror(errno));
> > +	}
> > +	ret = flock(fd, LOCK_EX);
> > +	if (ret < 0) {
> > +		close(fd);
> > +		rte_panic("flock %s failed: %s\n", filepath, strerror(errno));
> > +	}
> > +
> > +	ret = ftruncate(fd, size);
> > +	if (ret < 0) {
> > +		rte_panic("ftruncate failed: %s\n", strerror(errno));
> > +	}
> > +	/* flag: MAP_HUGETLB */
> 
> Any explanation what that comment means here?
> Do you plan to use MAP_HUGETLb in the call below or ...?

Yes, it's a todo item. Shm_open() just uses a tmpfs mounted at /dev/shm. So I wonder maybe we can use this flag to make
sure  os allocates hugepages here if user would like to use hugepages.

>>
......
> > @@ -1081,8 +1134,8 @@ rte_eal_hugepage_init(void)
> >
> >  	/* hugetlbfs can be disabled */
> >  	if (internal_config.no_hugetlbfs) {
> > -		addr = mmap(NULL, internal_config.memory, PROT_READ |
> PROT_WRITE,
> > -				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> > +		int fd;
> > +		addr = rte_eal_shm_create(&fd);
> 
> Why do you remove ability to map(dev/zero) here?
> Probably not everyone plan to use --no-hugepages only inside containers.

>From my understanding, mmap here is just to allocate some memory, which is initialized to be all zero. I cannot understand what's
the relationship with /dev/zero. rte_eal_shm_create() can do the original function, plus it generates a fd to point to this chunk of
memory. This fd is indispensable in vhost protocol when VHOST_USER_SET_MEM_TABLE using sendmsg().

> 
> 
> >  		if (addr == MAP_FAILED) {
> >  			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
> __func__,
> >  					strerror(errno));
> > @@ -1093,6 +1146,7 @@ rte_eal_hugepage_init(void)
> >  		mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
> >  		mcfg->memseg[0].len = internal_config.memory;
> >  		mcfg->memseg[0].socket_id = 0;
> > +		mcfg->memseg[0].fd = fd;
> >  		return 0;
> >  	}
> >
> > diff --git a/lib/librte_mempool/rte_mempool.c
> > b/lib/librte_mempool/rte_mempool.c
> > index e57cbbd..8f8852b 100644
> > --- a/lib/librte_mempool/rte_mempool.c
> > +++ b/lib/librte_mempool/rte_mempool.c
> > @@ -453,13 +453,6 @@ rte_mempool_xmem_create(const char *name,
> unsigned n, unsigned elt_size,
> >  		rte_errno = EINVAL;
> >  		return NULL;
> >  	}
> > -
> > -	/* check that we have both VA and PA */
> > -	if (vaddr != NULL && paddr == NULL) {
> > -		rte_errno = EINVAL;
> > -		return NULL;
> > -	}
> > -
> >  	/* Check that pg_num and pg_shift parameters are valid. */
> >  	if (pg_num < RTE_DIM(mp->elt_pa) || pg_shift >
> MEMPOOL_PG_SHIFT_MAX) {
> >  		rte_errno = EINVAL;
> > @@ -596,8 +589,15 @@ rte_mempool_xmem_create(const char *name,
> > unsigned n, unsigned elt_size,
> >
> >  	/* mempool elements in a separate chunk of memory. */
> >  	} else {
> > +		/* when VA is specified, PA should be specified? */
> > +		if (rte_eal_has_hugepages()) {
> > +			if (paddr == NULL) {
> > +				rte_errno = EINVAL;
> > +				return NULL;
> > +			}
> > +			memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0])
> * pg_num);
> > +		}
> >  		mp->elt_va_start = (uintptr_t)vaddr;
> > -		memcpy(mp->elt_pa, paddr, sizeof (mp->elt_pa[0]) *
> pg_num);
> 
> Could you explain the reason for that change?
> Specially why mempool over external memory now only allowed for
> hugepages config?
> Konstantin

Oops, you're right! This change was previously for creating a mbuf mempool at a given vaddr and without
giving any paddr[].  And now we don't need to care about neither vaddr nor paddr[] so I should have reverted
change in this file.

> 
> >  	}
> >
> >  	mp->elt_va_end = mp->elt_va_start;
> > --
> > 2.1.4



More information about the dev mailing list