[dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
Ferruh Yigit
ferruh.yigit at intel.com
Wed Apr 3 18:29:25 CEST 2019
On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
> From: Kiran Kumar K <kirankumark at marvell.com>
>
> With current KNI implementation kernel module will work only in
> IOVA=PA mode. This patch will add support for kernel module to work
> with IOVA=VA mode.
Thanks Kiran for removing the limitation, I have a few questions, can you please
help me understand.
And when this patch is ready, the restriction in 'linux/eal/eal.c', in
'rte_eal_init' should be removed, perhaps with this patch. I assume you already
doing it to be able to test this patch.
>
> The idea is to maintain a mapping in KNI module between user pages and
> kernel pages and in fast path perform a lookup in this table and get
> the kernel virtual address for corresponding user virtual address.
>
> In IOVA=VA mode, the memory allocated to the pool is physically
> and virtually contiguous. We will take advantage of this and create a
> mapping in the kernel.In kernel we need mapping for queues
> (tx_q, rx_q,... slow path) and mbuf memory (fast path).
Is it?
As far as I know mempool can have multiple chunks and they can be both virtually
and physically separated.
And even for a single chunk, that will be virtually continuous, but will it be
physically continuous?
>
> At the KNI init time, in slow path we will create a mapping for the
> queues and mbuf using get_user_pages similar to af_xdp. Using pool
> memory base address, we will create a page map table for the mbuf,
> which we will use in the fast path for kernel page translation.
>
> At KNI init time, we will pass the base address of the pool and size of
> the pool to kernel. In kernel, using get_user_pages API, we will get
> the pages with size PAGE_SIZE and store the mapping and start address
> of user space in a table.
>
> In fast path for any user address perform PAGE_SHIFT
> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
> we will get the index of the kernel page with in the page map table.
> Adding offset to this kernel page address, we will get the kernel address
> for this user virtual address.
>
> For example user pool base address is X, and size is S that we passed to
> kernel. In kernel we will create a mapping for this using get_user_pages.
> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
> and user start page will be U (we will get it from X >> PAGE_SHIFT).
>
> For any user address Z we will get the index of the page map table using
> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
> address will give kernel virtual address.
>
> Signed-off-by: Kiran Kumar K <kirankumark at marvell.com>
<...>
> +int
> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
> +{
> + unsigned int gup_flags = FOLL_WRITE;
> + long npgs;
> + int err;
> +
> + /* Get at least one page */
> + if (size < PAGE_SIZE)
> + size = PAGE_SIZE;
> +
> + /* Compute number of user pages based on page size */
> + mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> + /* Allocate memory for the pages */
> + mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
> + GFP_KERNEL | __GFP_NOWARN);
> + if (!mem->pgs) {
> + pr_err("%s: -ENOMEM\n", __func__);
> + return -ENOMEM;
> + }
> +
> + down_write(¤t->mm->mmap_sem);
> +
> + /* Get the user pages from the user address*/
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
> + npgs = get_user_pages((u64)address, mem->npgs,
> + gup_flags, &mem->pgs[0], NULL);
> +#else
> + npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
> + gup_flags, 0, &mem->pgs[0], NULL);
> +#endif
> + up_write(¤t->mm->mmap_sem);
This should work even memory is physically not continuous, right? Where exactly
physically continuous requirement is coming from?
<...>
> +
> +/* Get the kernel address from the user address using
> + * page map table. Will be used only in IOVA=VA mode
> + */
> +static inline void*
> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
> +{
> + uint32_t index;
> + /* User page - start user page will give the index
> + * with in the page map table
> + */
> + index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
> +
> + /* Add the offset to the page address */
> + return (kni->va_info.page_map[index].addr +
> + (usr_addr & kni->va_info.page_mask));
> +
> +}
> +
> /* physical address to kernel virtual address */
> static void *
> pa2kva(void *pa)
> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
> return;
>
> for (i = 0; i < num_rx; i++) {
> - kva = pa2kva(kni->pa[i]);
> + if (likely(kni->iova_mode == 1))
> + kva = get_kva((u64)(kni->pa[i]), kni);
kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
'mempool_memhdr->addr'
If this is working I must be missing something but not able to find what it is.
<...>
> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
> kni->group_id = conf->group_id;
> kni->mbuf_size = conf->mbuf_size;
>
> + dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
> + if (dev_info.iova_mode) {
> + struct rte_mempool_memhdr *hdr;
> + uint64_t pool_size = 0;
> +
> + /* In each pool header chunk, we will maintain the
> + * base address of the pool. This chunk is physically and
> + * virtually contiguous.
> + * This approach will work, only if the allocated pool
> + * memory is contiguous, else it won't work
> + */
> + hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
> + dev_info.mbuf_va = (void *)(hdr->addr);
> +
> + /* Traverse the list and get the total size of the pool */
> + STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
> + pool_size += hdr->len;
> + }
This code is aware that there may be multiple chunks, but assumes they are all
continuous, I don't know if this assumption is correct.
Also I guess there is another assumption that there will be single pktmbuf_pool
in the application which passed into kni?
What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
don't know their pages, so won't able to get their kernel virtual address.
More information about the dev
mailing list