[dpdk-dev] [RFC v5] /net: memory interface (memif)

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Fri May 3 06:27:37 CEST 2019


> On 3/22/2019 11:57 AM, Jakub Grajciar wrote:
> > Memory interface (memif), provides high performance packet transfer
> > over shared memory.
> >
> > Signed-off-by: Jakub Grajciar <jgrajcia at cisco.com>
> 
> <...>
> 
> > @@ -0,0 +1,200 @@
> > +..  SPDX-License-Identifier: BSD-3-Clause
> > +    Copyright(c) 2018-2019 Cisco Systems, Inc.
> > +
> > +======================
> > +Memif Poll Mode Driver
> > +======================
> > +Shared memory packet interface (memif) PMD allows for DPDK and any
> > +other client using memif (DPDK, VPP, libmemif) to communicate using
> > +shared memory. Memif is Linux only.
> > +
> > +The created device transmits packets in a raw format. It can be used
> > +with Ethernet mode, IP mode, or Punt/Inject. At this moment, only
> > +Ethernet mode is supported in DPDK memif implementation.
> > +
> > +Memif works in two roles: master and slave. Slave connects to master
> > +over an existing socket. It is also a producer of shared memory file
> > +and initializes the shared memory. Master creates the socket and
> > +listens for any slave connection requests. The socket may already
> > +exist on the system. Be sure to remove any such sockets, if you are
> > +creating a master interface, or you will see an "Address already in
> > +use" error. Function ``rte_pmd_memif_remove()``,
> 
> Can it be possible to remove this existing socket on successfully termination
> of the dpdk application with 'master' role?
> Otherwise each time to run a dpdk app, requires to delete the socket file first.
> 
> 
>     net_memif is the same case as net_virtio_user, I'd use the same
> workaround.
>     testpmd.c:pmd_test_exit() this, however, is only valid for testpmd.
> 
> 
> > +which removes memif interface, will also remove a listener socket, if
> > +it is not being used by any other interface.
> > +
> > +The method to enable one or more interfaces is to use the
> > +``--vdev=net_memif0`` option on the DPDK application command line.
> > +Each ``--vdev=net_memif1`` option given will create an interface
> > +named net_memif0, net_memif1, and so on. Memif uses unix domain
> > +socket to transmit control messages. Each memif has a unique id per
> > +socket. If you are connecting multiple interfaces using same socket,
> > +be sure to specify unique ids ``id=0``, ``id=1``, etc. Note that if
> > +you assign a socket to a master interface it becomes a listener
> > +socket. Listener socket can not be used by a slave interface on same client.
> 
> When I connect two slaves, id=0 & id=1 to the master, both master and
> second slave crashed as soon as second slave connected, is this a know issue?
> Can you please check this?
> 
> master: ./build/app/testpmd -w0:0.0 -l20,21 --vdev net_memif0,role=master
> -- -i
> slave1: ./build/app/testpmd -w0:0.0 -l20,22 --file-prefix=slave --vdev
> net_memif0,role=slave,id=0 -- -i
> slave2: ./build/app/testpmd -w0:0.0 -l20,23 --file-prefix=slave2 --vdev
> net_memif0,role=slave,id=1 -- -i
> 
> <...>
> 
> 
>     Each interface can be connected to one peer interface at the same time, I'll
>     add more details to the documentation.
> 
> 
> > +Example: testpmd and testpmd
> > +----------------------------
> > +In this example we run two instances of testpmd application and transmit
> packets over memif.
> 
> How this play with multi process support? When I run a secondary app when
> memif PMD is enabled in primary, secondary process crashes.
> Can you please check and ensure at least nothing crashes when a secondary
> app run?
> 
> 
>     For now I'd disable multi-process support, so we can get the patch applied,
> then
>     provide multi-process support in a separate patch.
> 
> 
> > +
> > +First create ``master`` interface::
> > +
> > +    #./testpmd -l 0-1 --proc-type=primary --file-prefix=pmd1
> > + --vdev=net_memif,role=master -- -i
> > +
> > +Now create ``slave`` interface (master must be already running so the
> slave will connect)::
> > +
> > +    #./testpmd -l 2-3 --proc-type=primary --file-prefix=pmd2
> > + --vdev=net_memif -- -i
> > +
> > +Set forwarding mode in one of the instances to 'rx only' and the other to
> 'tx only'::
> > +
> > +    testpmd> set fwd rxonly
> > +    testpmd> start
> > +
> > +    testpmd> set fwd txonly
> > +    testpmd> start
> 
> Would it be useful to add loopback option to the PMD, for testing/debug ?
> 
> Also I am getting low performance numbers above, comparing the ring pmd
> for example, is there any performance target for the pmd?
> Same forwarding core for both testpmds, this is terrible, either something is
> wrong or I am doing something wrong, it is ~40Kpps Different forwarding
> cores for each testpmd, still low, !18Mpps
> 
> <...>
> 
> 
>     The difference between ring pmd and memif pmd is that while memif
> transfers packets,
>     ring transfers whole buffers. (Also ring can not be used process to process)
>     That means, it does not have to alloc/free buffers.
>     I did a simple test where I modified the code so the tx function will only
> free given buffers
>     and rx allocates new buffers. On my machine, this can only handle 45-
> 50Mpps.
> 
>     With that in mind, I believe that 23Mpps is fine performance. No
> performance target is
>     defined, the goal is to be as fast as possible.
Use of C11 atomics have proven to provide better performance on weakly ordered architectures (at least on Arm). IMO, C11 atomics should be used to implement the fast path functions at least. This ensures optimal performance on all supported architectures in DPDK.

> 
>     The cause of the issue, where traffic drops to 40Kpps while using the same
> core for both applications,
>     is that within the timeslice given to testpmd process, memif driver fills its
> queues,
>     but keeps polling, not giving the other process a chance to receive the
> packets.
>     Same applies to rx side, where an empty queue is polled over and over
> again. In such configuration,
>     interrupt rx-mode should be used instead of polling. Or the application
> could suspend
>     the port.
> 
> 
> > +/**
> > + * Buffer descriptor.
> > + */
> > +typedef struct __rte_packed {
> > +     uint16_t flags;                         /**< flags */
> > +#define MEMIF_DESC_FLAG_NEXT 1                       /**< is chained buffer */
> 
> Is this define used?
> 
> <...>
> 
> 
>      Yes, see eth_memif_tx() and eth_memif_rx()
> 
> 
> > +     while (n_slots && n_rx_pkts < nb_pkts) {
> > +             mbuf_head = rte_pktmbuf_alloc(mq->mempool);
> > +             if (unlikely(mbuf_head == NULL))
> > +                     goto no_free_bufs;
> > +             mbuf = mbuf_head;
> > +             mbuf->port = mq->in_port;
> > +
> > + next_slot:
> > +             s0 = cur_slot & mask;
> > +             d0 = &ring->desc[s0];
> > +
> > +             src_len = d0->length;
> > +             dst_off = 0;
> > +             src_off = 0;
> > +
> > +             do {
> > +                     dst_len = mbuf_size - dst_off;
> > +                     if (dst_len == 0) {
> > +                             dst_off = 0;
> > +                             dst_len = mbuf_size +
> > + RTE_PKTMBUF_HEADROOM;
> > +
> > +                             mbuf = rte_pktmbuf_alloc(mq->mempool);
> > +                             if (unlikely(mbuf == NULL))
> > +                                     goto no_free_bufs;
> > +                             mbuf->port = mq->in_port;
> > +                             rte_pktmbuf_chain(mbuf_head, mbuf);
> > +                     }
> > +                     cp_len = RTE_MIN(dst_len, src_len);
> > +
> > +                     rte_pktmbuf_pkt_len(mbuf) =
> > +                         rte_pktmbuf_data_len(mbuf) += cp_len;
> 
> Can you please make this two lines to prevent confusion?
> 
> Also shouldn't need to add 'cp_len' to 'mbuf_head->pkt_len'?
> 'rte_pktmbuf_chain' updates 'mbuf_head' but at that stage 'mbuf->pkt_len' is
> not set to 'cp_len'...
> 
> <...>
> 
> 
>     Fix in next patch.
> 
> 
> > +             if (r->addr == NULL)
> > +                     return;
> > +             munmap(r->addr, r->region_size);
> > +             if (r->fd > 0) {
> > +                     close(r->fd);
> > +                     r->fd = -1;
> > +             }
> > +     }
> > +     rte_free(pmd->regions);
> > +}
> > +
> > +static int
> > +memif_alloc_regions(struct pmd_internals *pmd, uint8_t brn) {
> > +     struct memif_region *r;
> > +     char shm_name[32];
> > +     int i;
> > +     int ret = 0;
> > +
> > +     r = rte_zmalloc("memif_region", sizeof(struct memif_region) * (brn + 1),
> 0);
> > +     if (r == NULL) {
> > +             MIF_LOG(ERR, "%s: Failed to allocate regions.",
> > +                     rte_vdev_device_name(pmd->vdev));
> > +             return -ENOMEM;
> > +     }
> > +
> > +     pmd->regions = r;
> > +     pmd->regions_num = brn + 1;
> > +
> > +     /*
> > +      * Create shm for every region. Region 0 is reserved for descriptors.
> > +      * Other regions contain buffers.
> > +      */
> > +     for (i = 0; i < (brn + 1); i++) {
> > +             r = &pmd->regions[i];
> > +
> > +             r->buffer_offset = (i == 0) ? (pmd->run.num_s2m_rings +
> > +                                            pmd->run.num_m2s_rings) *
> > +                 (sizeof(memif_ring_t) +
> > +                  sizeof(memif_desc_t) * (1 <<
> > + pmd->run.log2_ring_size)) : 0;
> 
> what exactly "buffer_offset" is? For 'region 0' it calculates the size of the size
> of the 'region 0', otherwise 0. This is offset starting from where? And it seems
> only used for below size assignment.
> 
> 
>     It is possible to use single region for one connection (non-zero-copy) to
> reduce
>     the number of files opened. It will be implemented in the next patch.
>     'buffer_offset' is offset from the start of shared memory file to the first
> packet buffer.
> 
> 
> > +                             (1 << pmd->run.log2_ring_size) *
> > +                             (pmd->run.num_s2m_rings +
> > +                              pmd->run.num_m2s_rings));
> 
> There is an illusion of packet buffers can be in multiple regions (above
> comment implies it) but this logic assumes all packet buffer are in same
> region other than region 0, which gives us 'region 1' and this is already
> hardcoded in a few places. Is there a benefit of assuming there will be more
> regions, will it make simple to accept 'region 0' for rings and 'region 1' is for
> packet buffer?
> 
> 
>     Multiple regions are required in case of zero-copy slave. The zero-copy will
>     expose dpdk shared memory, where each memseg/memseg_list will be
>     represended by memif_region.
> 
> 
> > +int
> > +memif_connect(struct rte_eth_dev *dev) {
> > +     struct pmd_internals *pmd = dev->data->dev_private;
> > +     struct memif_region *mr;
> > +     struct memif_queue *mq;
> > +     int i;
> > +
> > +     for (i = 0; i < pmd->regions_num; i++) {
> > +             mr = pmd->regions + i;
> > +             if (mr != NULL) {
> > +                     if (mr->addr == NULL) {
> > +                             if (mr->fd < 0)
> > +                                     return -1;
> > +                             mr->addr = mmap(NULL, mr->region_size,
> > +                                             PROT_READ | PROT_WRITE,
> > +                                             MAP_SHARED, mr->fd, 0);
> > +                             if (mr->addr == NULL)
> > +                                     return -1;
> > +                     }
> > +             }
> > +     }
> 
> If multiple slave is not supported, is 'id' devarg still valid, or is it used at all?
> 
> 
>     'id' is used to identify peer interface. Interface with id=0 will only connect
> to an
>     interface with id=0 and so on...
> 
> 
> <...>
> 
> > +static int
> > +memif_create(struct rte_vdev_device *vdev, enum memif_role_t role,
> > +          memif_interface_id_t id, uint32_t flags,
> > +          const char *socket_filename,
> > +          memif_log2_ring_size_t log2_ring_size,
> > +          uint16_t buffer_size, const char *secret,
> > +          struct ether_addr *eth_addr) {
> > +     int ret = 0;
> > +     struct rte_eth_dev *eth_dev;
> > +     struct rte_eth_dev_data *data;
> > +     struct pmd_internals *pmd;
> > +     const unsigned int numa_node = vdev->device.numa_node;
> > +     const char *name = rte_vdev_device_name(vdev);
> > +
> > +     if (flags & ETH_MEMIF_FLAG_ZERO_COPY) {
> > +             MIF_LOG(ERR, "Zero-copy not supported.");
> > +             return -1;
> > +     }
> 
> What is the plan for the zero copy?
> Can you please put the status & plan to the documentation?
> 
> <...>
> 
> 
>     I don't think that putting the status in the docs is needed. Zero-copy slave
> is in testing stage
>     and I should be able to submit a patch once we get this one applied.
> 
> 
> > +
> > +     uint16_t last_head;                     /**< last ring head */
> > +     uint16_t last_tail;                     /**< last ring tail */
> 
> ring already has head, tail variables, what is the difference in queue ones?


More information about the dev mailing list