[dpdk-dev] [PATCH] dmadev: introduce DMA device library

fengchengwen fengchengwen at huawei.com
Thu Jul 8 05:11:51 CEST 2021


On 2021/7/7 19:01, Bruce Richardson wrote:
> On Wed, Jul 07, 2021 at 04:04:16PM +0530, Jerin Jacob wrote:
>> On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson
>> <bruce.richardson at intel.com> wrote:
>>>
>>> On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote:
>>>> On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson
>>>> <bruce.richardson at intel.com> wrote:
>>>>>
>>>>> On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote:
>>>>>>
>>>>>> On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson
>>>>>> <bruce.richardson at intel.com> wrote:
>>>>>>>
>>>>>>> On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote:
>>>>>>>> On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng <fengchengwen at huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
>>>>>>>>> device.
>>>>> <snip>
>>>>>>>
>>>>>>> +1 and the terminology with regards to queues and channels. With our ioat
>>>>>>> hardware, each HW queue was called a channel for instance.
>>>>>>
>>>>>> Looks like <dmadev> <> <channel> can cover all the use cases, if the
>>>>>> HW has more than
>>>>>> 1 queues it can be exposed as separate dmadev dev.
>>>>>>
>>>>>
>>>>> Fine for me.
>>>>>
>>>>> However, just to confirm that Morten's suggestion of using a
>>>>> (device-specific void *) channel pointer rather than dev_id + channel_id
>>>>> pair of parameters won't work for you? You can't store a pointer or dev
>>>>> index in the channel struct in the driver?
>>>>
>>>> Yes. That will work. To confirm, the suggestion is to use, void *
>>>> object instead of channel_id,
>>>> That will avoid one more indirection.(index -> pointer)
>>>>
>>>
>>> The proposal was to use it in place of "dev_id + channel_id", i.e.
>>>
>>> copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags)
>>>
>>> Where the channel pointer implicitly indicates the device too. However, I
>>> realise now that this would be something completely transparent to the
>>> driver as it would all have to be implemented in the dmadev level, and
>>> lead to lots of duplication of function pointers, etc. Therefore, let's
>>> just go with original scheme. :-(
>>
>> Yes. Just go with the original scheme.
>>
> +1
> 
>>>
>>>>
>>>>>
>>>>>>
>>>
>>> <snip>
>>>
>>>>>> Got it. In order to save space if first CL size for fastpath(Saving 8B
>>>>>> for the pointer) and to avoid
>>>>>> function overhead, Can we use one bit of flags of op function to
>>>>>> enable the fence?
>>>>>>
>>>>>
>>>>> The original ioat implementation did exactly that. However, I then
>>>>> discovered that because a fence logically belongs between two operations,
>>>>> does the fence flag on an operation mean "don't do any jobs after this
>>>>> until this job has completed" or does it mean "don't start this job until
>>>>> all previous jobs have completed". [Or theoretically does it mean both :-)]
>>>>> Naturally, some hardware does it the former way (i.e. fence flag goes on
>>>>> last op before fence), while other hardware the latter way (i.e. fence flag
>>>>> goes on first op after the fence). Therefore, since fencing is about
>>>>> ordering *between* two (sets of) jobs, I decided that it should do exactly
>>>>> that and go between two jobs, so there is no ambiguity!
>>>>>
>>>>> However, I'm happy enough to switch to having a fence flag, but I think if
>>>>> we do that, it should be put in the "first job after fence" case, because
>>>>> it is always easier to modify a previously written job if we need to, than
>>>>> to save the flag for a future one.
>>>>>
>>>>> Alternatively, if we keep the fence as a separate function, I'm happy
>>>>> enough for it not to be on the same cacheline as the "hot" operations,
>>>>> since fencing will always introduce a small penalty anyway.
>>>>
>>>> Ack.
>>>> You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If
>>>> there any use case for this or it makes sense for your HW)
>>>>
>>>>
>>>> For us, Fence is NOP for us as we have an implicit fence between each
>>>> HW job descriptor.
>>>>
>>>
>>> I actually still think that having a separate fence function in the "ops"
>>> section is the best approach here. It's unabiguous as to whether it's
>>> fence-before or fence-after, and if we have it in the ops, it doesn't use a
>>> "fast-path" slot.
>>>
>>> However, if we *really* want to use a flag instead, I don't see the value
>>> in having two flags, it will be really confusing.  Instead, if we do go
>>> with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating
>>> that the fence occurs before the job in question.
>>
>> IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE
>> due to overhead of the driver implementation where the fence request
>> can be NOP and
>> to save the first cache line occupancy.
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>> <snip>
>>>>>>>> Since we have additional function call overhead in all the
>>>>>>>> applications for this scheme, I would like to understand
>>>>>>>> the use of doing this way vs enq does the doorbell implicitly from
>>>>>>>> driver/application PoV?
>>>>>>>>
>>>>>>>
>>>>>>> In our benchmarks it's just faster. When we tested it, the overhead of the
>>>>>>> function calls was noticably less than the cost of building up the
>>>>>>> parameter array(s) for passing the jobs in as a burst. [We don't see this
>>>>>>> cost with things like NIC I/O since DPDK tends to already have the mbuf
>>>>>>> fully populated before the TX call anyway.]
>>>>>>
>>>>>> OK. I agree with stack population.
>>>>>>
>>>>>> My question was more on doing implicit doorbell update enq. Is doorbell write
>>>>>> costly in other HW compare to a function call? In our HW, it is just write of
>>>>>> the number of instructions written in a register.
>>>>>>
>>>>>> Also, we need to again access the internal PMD memory structure to find
>>>>>> where to write etc if it is a separate function.
>>>>>>
>>>>>
>>>>> The cost varies depending on a number of factors - even writing to a single
>>>>> HW register can be very slow if that register is mapped as device
>>>>> (uncacheable) memory, since (AFAIK) it will act as a full fence and wait
>>>>
>>>> I don't know, At least in our case, writes are write-back. so core does not need
>>>> to wait.(If there is no read operation).
>>>>
>>>>> for the write to go all the way to hardware. For more modern HW, the cost
>>>>> can be lighter. However, any cost of HW writes is going to be the same
>>>>> whether its a separate function call or not.
>>>>>
>>>>> However, the main thing about the doorbell update is that it's a
>>>>> once-per-burst thing, rather than a once-per-job. Therefore, even if you
>>>>> have to re-read the struct memory (which is likely still somewhere in your
>>>>> cores' cache), any extra small cost of doing so is to be amortized over the
>>>>> cost of a whole burst of copies.
>>>>
>>>> Linux kernel has xmit_more flag in skb to address similar thing.
>>>> i.e enq job flag can have one more bit field to say update ring bell or not?
>>>> Rather having yet another function overhead.IMO, it is the best of both worlds.
>>>>
>>>
>>> It's just more conditionals and branches all through the code. Inside the
>>> user application, the user has to check whether to set the flag or not (or
>>> special-case the last transaction outside the loop), and within the driver,
>>> there has to be a branch whether or not to call the doorbell function. The
>>> code on both sides is far simpler and more readable if the doorbell
>>> function is exactly that - a separate function.
>>
>> I disagree. The reason is:
>>
>> We will have two classes of applications
>>
>> a) do dma copy request as and when it has data(I think, this is the
>> prime use case), for those,
>> I think, it is considerable overhead to have two function invocation
>> per transfer i.e
>> rte_dma_copy() and rte_dma_perform()
>>
>> b) do dma copy when the data is reached to a logical state,  like copy
>> IP frame from Ethernet packets or so,
>> In that case, the application will have  a LOGIC to detect when to
>> perform it so on the end of
>> that rte_dma_copy() flag can be updated to fire the doorbell.
>>
>> IMO, We are comparing against a branch(flag is already in register) vs
>> a set of instructions for
>> 1) function pointer overhead
>> 2) Need to use the channel context again back in another function.
>>
>> IMO, a single branch is most optimal from performance PoV.
>>
> Ok, let's try it and see how it goes.

Test result show:
1) For Kunpeng platform (ARMv8) could benefit very little with doorbell in flags
2) For Xeon E5-2690 v2 (X86) could benefit with separate function
3) Both platform could benefit with doorbell in flags if burst < 5

There is a performance gain in small bursts (<5). Given the extensive use of bursts
in DPDK applications and users are accustomed to the concept, I do not recommend
using the 'doorbell' in flags.
And also user may confuse about the doorbell operations.

Kunpeng platform test result:
    [root at SZ tmp]# ./a1 1
    burst = 1
    perform_after_multiple_enqueue: burst:1 cost:0s.554422us
    doorbell_for_every_enqueue: burst:1 cost:0s.450927us
    last_enqueue_issue_doorbell: burst:1 cost:0s.450479us
    [root at SZ tmp]#
    [root at SZ tmp]# ./a1 2
    burst = 2
    perform_after_multiple_enqueue: burst:2 cost:0s.900884us
    doorbell_for_every_enqueue: burst:2 cost:0s.866732us
    last_enqueue_issue_doorbell: burst:2 cost:0s.732469us
    [root at SZ tmp]# ./a1 5
    burst = 5
    perform_after_multiple_enqueue: burst:5 cost:1s.732410us
    doorbell_for_every_enqueue: burst:5 cost:2s.115479us
    last_enqueue_issue_doorbell: burst:5 cost:1s.759349us
    [root at SZ tmp]# ./a1 10
    burst = 10
    perform_after_multiple_enqueue: burst:10 cost:3s.490716us
    doorbell_for_every_enqueue: burst:10 cost:4s.194691us
    last_enqueue_issue_doorbell: burst:10 cost:3s.331825us
    [root at SZ tmp]# ./a1 30
    burst = 30
    perform_after_multiple_enqueue: burst:30 cost:9s.61761us
    doorbell_for_every_enqueue: burst:30 cost:12s.517082us
    last_enqueue_issue_doorbell: burst:30 cost:9s.614802us

X86 platform test result:
    fengchengwen at SZ:~/tmp$ ./a1 1
    burst = 1
    perform_after_multiple_enqueue: burst:1 cost:0s.406331us
    doorbell_for_every_enqueue: burst:1 cost:0s.331109us
    last_enqueue_issue_doorbell: burst:1 cost:0s.381782us
    fengchengwen at SZ:~/tmp$ ./a1 2
    burst = 2
    perform_after_multiple_enqueue: burst:2 cost:0s.569024us
    doorbell_for_every_enqueue: burst:2 cost:0s.643449us
    last_enqueue_issue_doorbell: burst:2 cost:0s.486639us
    fengchengwen at SZ:~/tmp$ ./a1 5
    burst = 5
    perform_after_multiple_enqueue: burst:5 cost:1s.166384us
    doorbell_for_every_enqueue: burst:5 cost:1s.602369us
    last_enqueue_issue_doorbell: burst:5 cost:1s.209392us
    fengchengwen at SZ:~/tmp$ ./a1 10
    burst = 10
    perform_after_multiple_enqueue: burst:10 cost:2s.229901us
    doorbell_for_every_enqueue: burst:10 cost:3s.754802us
    last_enqueue_issue_doorbell: burst:10 cost:2s.328705us
    fengchengwen at SZ:~/tmp$
    fengchengwen at SZ:~/tmp$ ./a1 30
    burst = 30
    perform_after_multiple_enqueue: burst:30 cost:6s.132817us
    doorbell_for_every_enqueue: burst:30 cost:9s.944619us
    last_enqueue_issue_doorbell: burst:30 cost:7s.73551us


test-code:

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>
#include <sys/time.h>

struct dmadev;

unsigned int dev_reg[10240];
volatile unsigned int *ring;
volatile unsigned int *doorbell;

void init_global(void)
{
        ring = &dev_reg[100];
        doorbell = &dev_reg[10000];
}

#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
//#define rte_wmb() asm volatile ("" : : : "memory")

typedef int (*enqueue_t)(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags);
typedef void (*perform_t)(struct dmadev *dev, int vchan);

struct dmadev {
        enqueue_t enqueue;
        perform_t perform;
        char rsv[512];
};

int hisi_dma_enqueue(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags)
{
        *ring = 1;
}

int hisi_dma_enqueue_doorbell(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags)
{
        *ring = 1;
        if (flags == 1) {
                rte_wmb();
                *doorbell = 1;
        }
}

void hisi_dma_perform(struct dmadev *dev, int vchan)
{
        rte_wmb();
        *doorbell = 1;
}

struct dmadev devlist[64];

void init_devlist(bool enq_doorbell)
{
        int i;
        for (i = 0; i < 64; i++) {
                devlist[i].enqueue = enq_doorbell ? hisi_dma_enqueue_doorbell : hisi_dma_enqueue;
                devlist[i].perform = hisi_dma_perform;
        }
}

static inline int dma_enqueue(int dev_id, int vchan, void *src, void *dst, int len, int flags)
{
        struct dmadev *dev = &devlist[dev_id];
        return dev->enqueue(dev, vchan, src, dst, len, flags);
}

static inline void dma_perform(int dev_id, int vchan)
{
        struct dmadev *dev = &devlist[dev_id];
        return dev->perform(dev, vchan);
}

#define MAX_LOOPS       90000000

void test_for_perform_after_multiple_enqueue(int burst)
{
        struct timeval start, end, delta;
        unsigned int i, j;
        init_devlist(false);
        gettimeofday(&start, NULL);
        for (i = 0; i < MAX_LOOPS; i++) {
                for (j = 0; j < burst; j++)
                        (void)dma_enqueue(10, 0, NULL, NULL, 0, 0);
                dma_perform(10, 0);
        }
        gettimeofday(&end, NULL);
        timersub(&end, &start, &delta);
        printf("perform_after_multiple_enqueue: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec);
}

void test_for_doorbell_for_every_enqueue(int burst)
{
        struct timeval start, end, delta;
        unsigned int i, j;
        init_devlist(true);
        gettimeofday(&start, NULL);
        for (i = 0; i < MAX_LOOPS; i++) {
                for (j = 0; j < burst; j++)
                        (void)dma_enqueue(10, 0, NULL, NULL, 0, 1);
        }
        gettimeofday(&end, NULL);
        timersub(&end, &start, &delta);
        printf("doorbell_for_every_enqueue: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec);
}

void test_for_last_enqueue_issue_doorbell(int burst)
{
        struct timeval start, end, delta;
        unsigned int i, j;
        init_devlist(true);
        gettimeofday(&start, NULL);
        for (i = 0; i < MAX_LOOPS; i++) {
                for (j = 0; j < burst - 1; j++)
                        (void)dma_enqueue(10, 0, NULL, NULL, 0, 0);
                dma_enqueue(10, 0, NULL, NULL, 0, 1);
        }
        gettimeofday(&end, NULL);
        timersub(&end, &start, &delta);
        printf("last_enqueue_issue_doorbell: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec);
}

void main(int argc, char *argv[])
{
        if (argc < 2) {
                printf("please input burst parameter!\n");
                return;
        }
        init_global();
        int burst = atol(argv[1]);
        printf("burst = %d \n", burst);
        test_for_perform_after_multiple_enqueue(burst);
        test_for_doorbell_for_every_enqueue(burst);
        test_for_last_enqueue_issue_doorbell(burst);
}

> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>> <snip>
>>>>>>>>> + +/** + * @warning + * @b EXPERIMENTAL: this API may change
>>>>>>>>> without prior notice.  + * + * Returns the number of operations
>>>>>>>>> that failed to complete.  + * NOTE: This API was used when
>>>>>>>>> rte_dmadev_completed has_error was set.  + * + * @param dev_id
>>>>>>>>> + *   The identifier of the device.  + * @param vq_id + *   The
>>>>>>>>> identifier of virt queue.
>>>>>>>> (> + * @param nb_status
>>>>>>>>> + *   Indicates the size  of status array.  + * @param[out]
>>>>>>>>> status + *   The error code of operations that failed to
>>>>>>>>> complete.  + * @param[out] cookie + *   The last failed
>>>>>>>>> completed operation's cookie.  + * + * @return + *   The number
>>>>>>>>> of operations that failed to complete.  + * + * NOTE: The
>>>>>>>>> caller must ensure that the input parameter is valid and the +
>>>>>>>>> *       corresponding device supports the operation.  + */
>>>>>>>>> +__rte_experimental +static inline uint16_t
>>>>>>>>> +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, +
>>>>>>>>> const uint16_t nb_status, uint32_t *status, +
>>>>>>>>> dma_cookie_t *cookie)
>>>>>>>>
>>>>>>>> IMO, it is better to move cookie/rind_idx at 3.  Why it would
>>>>>>>> return any array of errors? since it called after
>>>>>>>> rte_dmadev_completed() has has_error. Is it better to change
>>>>>>>>
>>>>>>>> rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id,
>>>>>>>> dma_cookie_t *cookie,  uint32_t *status)
>>>>>>>>
>>>>>>>> I also think, we may need to set status as bitmask and enumerate
>>>>>>>> all the combination of error codes of all the driver and return
>>>>>>>> string from driver existing rte_flow_error
>>>>>>>>
>>>>>>>> See struct rte_flow_error { enum rte_flow_error_type type; /**<
>>>>>>>> Cause field and error types. */ const void *cause; /**< Object
>>>>>>>> responsible for the error. */ const char *message; /**<
>>>>>>>> Human-readable error message. */ };
>>>>>>>>
>>>>>>>
>>>>>>> I think we need a multi-return value API here, as we may add
>>>>>>> operations in future which have non-error status values to return.
>>>>>>> The obvious case is DMA engines which support "compare" operations.
>>>>>>> In that case a successful compare (as in there were no DMA or HW
>>>>>>> errors) can return "equal" or "not-equal" as statuses. For general
>>>>>>> "copy" operations, the faster completion op can be used to just
>>>>>>> return successful values (and only call this status version on
>>>>>>> error), while apps using those compare ops or a mixture of copy and
>>>>>>> compare ops, would always use the slower one that returns status
>>>>>>> values for each and every op..
>>>>>>>
>>>>>>> The ioat APIs used 32-bit integer values for this status array so
>>>>>>> as to allow e.g. 16-bits for error code and 16-bits for future
>>>>>>> status values. For most operations there should be a fairly small
>>>>>>> set of things that can go wrong, i.e. bad source address, bad
>>>>>>> destination address or invalid length.  Within that we may have a
>>>>>>> couple of specifics for why an address is bad, but even so I don't
>>>>>>> think we need to start having multiple bit combinations.
>>>>>>
>>>>>> OK. What is the purpose of errors status? Is it for application
>>>>>> printing it or Does the application need to take any action based on
>>>>>> specific error requests?
>>>>>
>>>>> It's largely for information purposes, but in the case of SVA/SVM
>>>>> errors could occur due to the memory not being pinned, i.e. a page
>>>>> fault, in some cases. If that happens, then it's up the app to either
>>>>> touch the memory and retry the copy, or to do a SW memcpy as a
>>>>> fallback.
>>>>>
>>>>> In other error cases, I think it's good to tell the application if it's
>>>>> passing around bad data, or data that is beyond the scope of hardware,
>>>>> e.g.  a copy that is beyond what can be done in a single transaction
>>>>> for a HW instance. Given that there are always things that can go
>>>>> wrong, I think we need some error reporting mechanism.
>>>>>
>>>>>> If the former is scope, then we need to define the standard enum
>>>>>> value for the error right?  ie. uint32_t *status needs to change to
>>>>>> enum rte_dma_error or so.
>>>>>>
>>>>> Sure. Perhaps an error/status structure either is an option, where we
>>>>> explicitly call out error info from status info.
>>>>
>>>> Agree. Better to have a structure with filed like,
>>>>
>>>> 1)  enum rte_dma_error_type 2)  memory to store, informative message on
>>>> fine aspects of error.  LIke address caused issue etc.(Which will be
>>>> driver-specific information).
>>>>
>>> The only issue I have with that is that once we have driver specific
>>> information it is of little use to the application, since it can't know
>>> anything about it excepy maybe log it.  I'd much rather have a set of error
>>> codes telling user that "source address is wrong", "dest address is wrong",
>>> and a generic "an address is wrong" in case driver/HW cannot distinguish
>>> source of error. Can we see how far we get with just error codes before we
>>> start into passing string messages around and all the memory management
>>> headaches that implies.
>>
>> Works for me. It should be "enum rte_dma_error_type" then, which has a standard
>> error type. Which is missing in the spec now.
>>
> +1
> .
> 


More information about the dev mailing list