[dpdk-dev] [RFC v5] /net: memory interface (memif)

Ferruh Yigit ferruh.yigit at intel.com
Mon Mar 25 21:58:36 CET 2019


On 3/22/2019 11:57 AM, Jakub Grajciar wrote:
> Memory interface (memif), provides high performance
> packet transfer over shared memory.
> 
> Signed-off-by: Jakub Grajciar <jgrajcia at cisco.com>

<...>

> @@ -0,0 +1,200 @@
> +..  SPDX-License-Identifier: BSD-3-Clause
> +    Copyright(c) 2018-2019 Cisco Systems, Inc.
> +
> +======================
> +Memif Poll Mode Driver
> +======================
> +Shared memory packet interface (memif) PMD allows for DPDK and any other client
> +using memif (DPDK, VPP, libmemif) to communicate using shared memory. Memif is
> +Linux only.
> +
> +The created device transmits packets in a raw format. It can be used with
> +Ethernet mode, IP mode, or Punt/Inject. At this moment, only Ethernet mode is
> +supported in DPDK memif implementation.
> +
> +Memif works in two roles: master and slave. Slave connects to master over an
> +existing socket. It is also a producer of shared memory file and initializes
> +the shared memory. Master creates the socket and listens for any slave
> +connection requests. The socket may already exist on the system. Be sure to
> +remove any such sockets, if you are creating a master interface, or you will
> +see an "Address already in use" error. Function ``rte_pmd_memif_remove()``,

Can it be possible to remove this existing socket on successfully termination of
the dpdk application with 'master' role?
Otherwise each time to run a dpdk app, requires to delete the socket file first.

> +which removes memif interface, will also remove a listener socket, if it is
> +not being used by any other interface.
> +
> +The method to enable one or more interfaces is to use the
> +``--vdev=net_memif0`` option on the DPDK application command line. Each
> +``--vdev=net_memif1`` option given will create an interface named net_memif0,
> +net_memif1, and so on. Memif uses unix domain socket to transmit control
> +messages. Each memif has a unique id per socket. If you are connecting multiple
> +interfaces using same socket, be sure to specify unique ids ``id=0``, ``id=1``,
> +etc. Note that if you assign a socket to a master interface it becomes a
> +listener socket. Listener socket can not be used by a slave interface on same
> +client.

When I connect two slaves, id=0 & id=1 to the master, both master and second
slave crashed as soon as second slave connected, is this a know issue? Can you
please check this?

master: ./build/app/testpmd -w0:0.0 -l20,21 --vdev net_memif0,role=master -- -i
slave1: ./build/app/testpmd -w0:0.0 -l20,22 --file-prefix=slave --vdev
net_memif0,role=slave,id=0 -- -i
slave2: ./build/app/testpmd -w0:0.0 -l20,23 --file-prefix=slave2 --vdev
net_memif0,role=slave,id=1 -- -i

<...>

> +Example: testpmd and testpmd
> +----------------------------
> +In this example we run two instances of testpmd application and transmit packets over memif.

How this play with multi process support? When I run a secondary app when memif
PMD is enabled in primary, secondary process crashes.
Can you please check and ensure at least nothing crashes when a secondary app run?

> +
> +First create ``master`` interface::
> +
> +    #./testpmd -l 0-1 --proc-type=primary --file-prefix=pmd1 --vdev=net_memif,role=master -- -i
> +
> +Now create ``slave`` interface (master must be already running so the slave will connect)::
> +
> +    #./testpmd -l 2-3 --proc-type=primary --file-prefix=pmd2 --vdev=net_memif -- -i
> +
> +Set forwarding mode in one of the instances to 'rx only' and the other to 'tx only'::
> +
> +    testpmd> set fwd rxonly
> +    testpmd> start
> +
> +    testpmd> set fwd txonly
> +    testpmd> start

Would it be useful to add loopback option to the PMD, for testing/debug ?

Also I am getting low performance numbers above, comparing the ring pmd for
example, is there any performance target for the pmd?
Same forwarding core for both testpmds, this is terrible, either something is
wrong or I am doing something wrong, it is ~40Kpps
Different forwarding cores for each testpmd, still low, !18Mpps

<...>

> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +CFLAGS += -DALLOW_EXPERIMENTAL_API

Can you please add here which experimental APIs are called (as a comment), this
help to keep trace of them in long term?

<...>

> +/**
> + * Buffer descriptor.
> + */
> +typedef struct __rte_packed {
> +	uint16_t flags;				/**< flags */
> +#define MEMIF_DESC_FLAG_NEXT 1			/**< is chained buffer */

Is this define used?

<...>

> +
> +static struct rte_vdev_driver pmd_memif_drv;

Same comment from previous review, is this forward deceleration required?

<...>

> +static memif_ring_t *
> +memif_get_ring(struct pmd_internals *pmd, memif_ring_type_t type, uint16_t ring_num)
> +{
> +	/* rings only in region 0 */
> +	void *p = pmd->regions[0].addr;
> +	int ring_size = sizeof(memif_ring_t) + sizeof(memif_desc_t) *
> +	    (1 << pmd->run.log2_ring_size);
> +
> +	p = (uint8_t *)p + (ring_num + type * pmd->run.num_s2m_rings) * ring_size;

According above code, I guess layout is as following [1], can you please correct
if it is wrong?

Can you please put this information somewhere, possibly the function comment
that allocates it, so that everyone doesn't need to figure out.


[1]

region 0:
+-----------------------+
| S2M rings | M2S rings |
+-----------------------+

S2M OR M2S Rings:
+-----------------------------------------+
| ring 0 | ring 1 | ring num_s2m_rings - 1|
+-----------------------------------------+

ring 0:
+-----------------------------------------------------+
| Ring Header | (1 << pmd->run.log2_ring_size) * desc |
+-----------------------------------------------------+


region 1:
+-------------------------------------------------------------------------+
| packet buffer 0 | . | pb ((1 << pmd->run.log2_ring_size)*(s2m + m2s))-1 |
+-------------------------------------------------------------------------+
Is there any order on packet buffers?

<...>

> +	while (n_slots && n_rx_pkts < nb_pkts) {
> +		mbuf_head = rte_pktmbuf_alloc(mq->mempool);
> +		if (unlikely(mbuf_head == NULL))
> +			goto no_free_bufs;
> +		mbuf = mbuf_head;
> +		mbuf->port = mq->in_port;
> +
> + next_slot:
> +		s0 = cur_slot & mask;
> +		d0 = &ring->desc[s0];
> +
> +		src_len = d0->length;
> +		dst_off = 0;
> +		src_off = 0;
> +
> +		do {
> +			dst_len = mbuf_size - dst_off;
> +			if (dst_len == 0) {
> +				dst_off = 0;
> +				dst_len = mbuf_size + RTE_PKTMBUF_HEADROOM;
> +
> +				mbuf = rte_pktmbuf_alloc(mq->mempool);
> +				if (unlikely(mbuf == NULL))
> +					goto no_free_bufs;
> +				mbuf->port = mq->in_port;
> +				rte_pktmbuf_chain(mbuf_head, mbuf);
> +			}
> +			cp_len = RTE_MIN(dst_len, src_len);
> +
> +			rte_pktmbuf_pkt_len(mbuf) =
> +			    rte_pktmbuf_data_len(mbuf) += cp_len;

Can you please make this two lines to prevent confusion?

Also shouldn't need to add 'cp_len' to 'mbuf_head->pkt_len'?
'rte_pktmbuf_chain' updates 'mbuf_head' but at that stage 'mbuf->pkt_len' is not
set to 'cp_len'...

<...>

> +void
> +memif_free_regions(struct pmd_internals *pmd)
> +{
> +	int i;
> +	struct memif_region *r;
> +
> +	for (i = 0; i < pmd->regions_num; i++) {
> +		r = pmd->regions + i;
> +		if (r == NULL)
> +			return;

'r' can be NULL only when 'pmd->region' is NULL and 'i == 0', so is it better to
check 'pmd->region' is NULL check before the loop?

> +		if (r->addr == NULL)
> +			return;
> +		munmap(r->addr, r->region_size);
> +		if (r->fd > 0) {
> +			close(r->fd);
> +			r->fd = -1;
> +		}
> +	}
> +	rte_free(pmd->regions);
> +}
> +
> +static int
> +memif_alloc_regions(struct pmd_internals *pmd, uint8_t brn)
> +{
> +	struct memif_region *r;
> +	char shm_name[32];
> +	int i;
> +	int ret = 0;
> +
> +	r = rte_zmalloc("memif_region", sizeof(struct memif_region) * (brn + 1), 0);
> +	if (r == NULL) {
> +		MIF_LOG(ERR, "%s: Failed to allocate regions.",
> +			rte_vdev_device_name(pmd->vdev));
> +		return -ENOMEM;
> +	}
> +
> +	pmd->regions = r;
> +	pmd->regions_num = brn + 1;
> +
> +	/*
> +	 * Create shm for every region. Region 0 is reserved for descriptors.
> +	 * Other regions contain buffers.
> +	 */
> +	for (i = 0; i < (brn + 1); i++) {
> +		r = &pmd->regions[i];
> +
> +		r->buffer_offset = (i == 0) ? (pmd->run.num_s2m_rings +
> +					       pmd->run.num_m2s_rings) *
> +		    (sizeof(memif_ring_t) +
> +		     sizeof(memif_desc_t) * (1 << pmd->run.log2_ring_size)) : 0;

For complex operations can you please prefer regular if check against ternary,
with the help of the formatting, it is hard to follow this.

what exactly "buffer_offset" is? For 'region 0' it calculates the size of the
size of the 'region 0', otherwise 0. This is offset starting from where? And it
seems only used for below size assignment.

> +		r->region_size = (i == 0) ? r->buffer_offset :
> +		    (uint32_t)(pmd->run.buffer_size *

I guess 'buffer_size' is buffer size per packet, to make this clear, what do you
think to rename it 'packet_buffer_size' ?

> +				(1 << pmd->run.log2_ring_size) *
> +				(pmd->run.num_s2m_rings +
> +				 pmd->run.num_m2s_rings));

There is an illusion of packet buffers can be in multiple regions (above comment
implies it) but this logic assumes all packet buffer are in same region other
than region 0, which gives us 'region 1' and this is already hardcoded in a few
places. Is there a benefit of assuming there will be more regions, will it make
simple to accept 'region 0' for rings and 'region 1' is for packet buffer?

> +
> +		memset(shm_name, 0, sizeof(char) * 32);
> +		sprintf(shm_name, "memif region %d", i);

snprintf please, and can you please add a define for shm_name size?

<...>

> +static void
> +memif_init_rings(struct rte_eth_dev *dev)
> +{
> +	struct pmd_internals *pmd = dev->data->dev_private;
> +	memif_ring_t *ring;
> +	int i, j;
> +	uint16_t slot;
> +
> +	for (i = 0; i < pmd->run.num_s2m_rings; i++) {
> +		ring = memif_get_ring(pmd, MEMIF_RING_S2M, i);
> +		ring->head = 0;
> +		ring->tail = 0;
> +		ring->cookie = MEMIF_COOKIE;
> +		ring->flags = 0;
> +		for (j = 0; j < (1 << pmd->run.log2_ring_size); j++) {
> +			slot = i * (1 << pmd->run.log2_ring_size) + j;
> +			ring->desc[j].region = 1;

Why 'region' 1 is hardcoded for buffer, can it be possible to use multiple
regions for buffers?

<...>

> +int
> +memif_connect(struct rte_eth_dev *dev)
> +{
> +	struct pmd_internals *pmd = dev->data->dev_private;
> +	struct memif_region *mr;
> +	struct memif_queue *mq;
> +	int i;
> +
> +	for (i = 0; i < pmd->regions_num; i++) {
> +		mr = pmd->regions + i;
> +		if (mr != NULL) {
> +			if (mr->addr == NULL) {
> +				if (mr->fd < 0)
> +					return -1;
> +				mr->addr = mmap(NULL, mr->region_size,
> +						PROT_READ | PROT_WRITE,
> +						MAP_SHARED, mr->fd, 0);
> +				if (mr->addr == NULL)
> +					return -1;
> +			}
> +		}
> +	}

For one master work with multiple slaves, there should be an array of regions,
right, one for for each slave.
Also same concern is valid for Rx/Tx queues, master and slave uses same
dev->data->tx_queues / dev->data->rx_queue, but crossover. I think a more
complex logic required for multiple slave support.

If multiple slave is not supported, is 'id' devarg still valid, or is it used at
all?

<...>

> +static int
> +memif_create(struct rte_vdev_device *vdev, enum memif_role_t role,
> +	     memif_interface_id_t id, uint32_t flags,
> +	     const char *socket_filename,
> +	     memif_log2_ring_size_t log2_ring_size,
> +	     uint16_t buffer_size, const char *secret,
> +	     struct ether_addr *eth_addr)
> +{
> +	int ret = 0;
> +	struct rte_eth_dev *eth_dev;
> +	struct rte_eth_dev_data *data;
> +	struct pmd_internals *pmd;
> +	const unsigned int numa_node = vdev->device.numa_node;
> +	const char *name = rte_vdev_device_name(vdev);
> +
> +	if (flags & ETH_MEMIF_FLAG_ZERO_COPY) {
> +		MIF_LOG(ERR, "Zero-copy not supported.");
> +		return -1;
> +	}

What is the plan for the zero copy?
Can you please put the status & plan to the documentation?

<...>

> +struct memif_queue {
> +	struct rte_mempool *mempool;		/**< mempool for RX packets */
> +	uint16_t in_port;			/**< port id */
> +
> +	struct pmd_internals *pmd;		/**< device internals */
> +
> +	struct rte_intr_handle intr_handle;	/**< interrupt handle */
> +
> +	/* ring info */
> +	memif_ring_type_t type;			/**< ring type */
> +	memif_ring_t *ring;			/**< pointer to ring */
> +	memif_log2_ring_size_t log2_ring_size;	/**< log2 of ring size */
> +
> +	memif_region_index_t region;		/**< shared memory region index */
> +	memif_region_offset_t offset;		/**< offset at which the queue begins */

Offset from where, it is start of the region but I think better to detail in the
comment.

> +
> +	uint16_t last_head;			/**< last ring head */
> +	uint16_t last_tail;			/**< last ring tail */

ring already has head, tail variables, what is the difference in queue ones?


More information about the dev mailing list