[dpdk-dev] [PATCH] dmadev: introduce DMA device library

Jerin Jacob jerinjacobk at gmail.com
Wed Jul 7 12:34:16 CEST 2021


On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson
<bruce.richardson at intel.com> wrote:
>
> On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote:
> > On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson
> > <bruce.richardson at intel.com> wrote:
> > >
> > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote:
> > > >
> > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson
> > > > <bruce.richardson at intel.com> wrote:
> > > > >
> > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote:
> > > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng <fengchengwen at huawei.com> wrote:
> > > > > > >
> > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > > > device.
> > > <snip>
> > > > >
> > > > > +1 and the terminology with regards to queues and channels. With our ioat
> > > > > hardware, each HW queue was called a channel for instance.
> > > >
> > > > Looks like <dmadev> <> <channel> can cover all the use cases, if the
> > > > HW has more than
> > > > 1 queues it can be exposed as separate dmadev dev.
> > > >
> > >
> > > Fine for me.
> > >
> > > However, just to confirm that Morten's suggestion of using a
> > > (device-specific void *) channel pointer rather than dev_id + channel_id
> > > pair of parameters won't work for you? You can't store a pointer or dev
> > > index in the channel struct in the driver?
> >
> > Yes. That will work. To confirm, the suggestion is to use, void *
> > object instead of channel_id,
> > That will avoid one more indirection.(index -> pointer)
> >
>
> The proposal was to use it in place of "dev_id + channel_id", i.e.
>
> copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags)
>
> Where the channel pointer implicitly indicates the device too. However, I
> realise now that this would be something completely transparent to the
> driver as it would all have to be implemented in the dmadev level, and
> lead to lots of duplication of function pointers, etc. Therefore, let's
> just go with original scheme. :-(

Yes. Just go with the original scheme.

>
> >
> > >
> > > >
>
> <snip>
>
> > > > Got it. In order to save space if first CL size for fastpath(Saving 8B
> > > > for the pointer) and to avoid
> > > > function overhead, Can we use one bit of flags of op function to
> > > > enable the fence?
> > > >
> > >
> > > The original ioat implementation did exactly that. However, I then
> > > discovered that because a fence logically belongs between two operations,
> > > does the fence flag on an operation mean "don't do any jobs after this
> > > until this job has completed" or does it mean "don't start this job until
> > > all previous jobs have completed". [Or theoretically does it mean both :-)]
> > > Naturally, some hardware does it the former way (i.e. fence flag goes on
> > > last op before fence), while other hardware the latter way (i.e. fence flag
> > > goes on first op after the fence). Therefore, since fencing is about
> > > ordering *between* two (sets of) jobs, I decided that it should do exactly
> > > that and go between two jobs, so there is no ambiguity!
> > >
> > > However, I'm happy enough to switch to having a fence flag, but I think if
> > > we do that, it should be put in the "first job after fence" case, because
> > > it is always easier to modify a previously written job if we need to, than
> > > to save the flag for a future one.
> > >
> > > Alternatively, if we keep the fence as a separate function, I'm happy
> > > enough for it not to be on the same cacheline as the "hot" operations,
> > > since fencing will always introduce a small penalty anyway.
> >
> > Ack.
> > You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If
> > there any use case for this or it makes sense for your HW)
> >
> >
> > For us, Fence is NOP for us as we have an implicit fence between each
> > HW job descriptor.
> >
>
> I actually still think that having a separate fence function in the "ops"
> section is the best approach here. It's unabiguous as to whether it's
> fence-before or fence-after, and if we have it in the ops, it doesn't use a
> "fast-path" slot.
>
> However, if we *really* want to use a flag instead, I don't see the value
> in having two flags, it will be really confusing.  Instead, if we do go
> with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating
> that the fence occurs before the job in question.

IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE
due to overhead of the driver implementation where the fence request
can be NOP and
to save the first cache line occupancy.

>
> >
> > >
> > > > >
> > > > > >
> > > <snip>
> > > > > > Since we have additional function call overhead in all the
> > > > > > applications for this scheme, I would like to understand
> > > > > > the use of doing this way vs enq does the doorbell implicitly from
> > > > > > driver/application PoV?
> > > > > >
> > > > >
> > > > > In our benchmarks it's just faster. When we tested it, the overhead of the
> > > > > function calls was noticably less than the cost of building up the
> > > > > parameter array(s) for passing the jobs in as a burst. [We don't see this
> > > > > cost with things like NIC I/O since DPDK tends to already have the mbuf
> > > > > fully populated before the TX call anyway.]
> > > >
> > > > OK. I agree with stack population.
> > > >
> > > > My question was more on doing implicit doorbell update enq. Is doorbell write
> > > > costly in other HW compare to a function call? In our HW, it is just write of
> > > > the number of instructions written in a register.
> > > >
> > > > Also, we need to again access the internal PMD memory structure to find
> > > > where to write etc if it is a separate function.
> > > >
> > >
> > > The cost varies depending on a number of factors - even writing to a single
> > > HW register can be very slow if that register is mapped as device
> > > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait
> >
> > I don't know, At least in our case, writes are write-back. so core does not need
> > to wait.(If there is no read operation).
> >
> > > for the write to go all the way to hardware. For more modern HW, the cost
> > > can be lighter. However, any cost of HW writes is going to be the same
> > > whether its a separate function call or not.
> > >
> > > However, the main thing about the doorbell update is that it's a
> > > once-per-burst thing, rather than a once-per-job. Therefore, even if you
> > > have to re-read the struct memory (which is likely still somewhere in your
> > > cores' cache), any extra small cost of doing so is to be amortized over the
> > > cost of a whole burst of copies.
> >
> > Linux kernel has xmit_more flag in skb to address similar thing.
> > i.e enq job flag can have one more bit field to say update ring bell or not?
> > Rather having yet another function overhead.IMO, it is the best of both worlds.
> >
>
> It's just more conditionals and branches all through the code. Inside the
> user application, the user has to check whether to set the flag or not (or
> special-case the last transaction outside the loop), and within the driver,
> there has to be a branch whether or not to call the doorbell function. The
> code on both sides is far simpler and more readable if the doorbell
> function is exactly that - a separate function.

I disagree. The reason is:

We will have two classes of applications

a) do dma copy request as and when it has data(I think, this is the
prime use case), for those,
I think, it is considerable overhead to have two function invocation
per transfer i.e
rte_dma_copy() and rte_dma_perform()

b) do dma copy when the data is reached to a logical state,  like copy
IP frame from Ethernet packets or so,
In that case, the application will have  a LOGIC to detect when to
perform it so on the end of
that rte_dma_copy() flag can be updated to fire the doorbell.

IMO, We are comparing against a branch(flag is already in register) vs
a set of instructions for
1) function pointer overhead
2) Need to use the channel context again back in another function.

IMO, a single branch is most optimal from performance PoV.


>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > <snip>
> > > > > > > + +/** + * @warning + * @b EXPERIMENTAL: this API may change
> > > > > > > without prior notice.  + * + * Returns the number of operations
> > > > > > > that failed to complete.  + * NOTE: This API was used when
> > > > > > > rte_dmadev_completed has_error was set.  + * + * @param dev_id
> > > > > > > + *   The identifier of the device.  + * @param vq_id + *   The
> > > > > > > identifier of virt queue.
> > > > > > (> + * @param nb_status
> > > > > > > + *   Indicates the size  of status array.  + * @param[out]
> > > > > > > status + *   The error code of operations that failed to
> > > > > > > complete.  + * @param[out] cookie + *   The last failed
> > > > > > > completed operation's cookie.  + * + * @return + *   The number
> > > > > > > of operations that failed to complete.  + * + * NOTE: The
> > > > > > > caller must ensure that the input parameter is valid and the +
> > > > > > > *       corresponding device supports the operation.  + */
> > > > > > > +__rte_experimental +static inline uint16_t
> > > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, +
> > > > > > > const uint16_t nb_status, uint32_t *status, +
> > > > > > > dma_cookie_t *cookie)
> > > > > >
> > > > > > IMO, it is better to move cookie/rind_idx at 3.  Why it would
> > > > > > return any array of errors? since it called after
> > > > > > rte_dmadev_completed() has has_error. Is it better to change
> > > > > >
> > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id,
> > > > > > dma_cookie_t *cookie,  uint32_t *status)
> > > > > >
> > > > > > I also think, we may need to set status as bitmask and enumerate
> > > > > > all the combination of error codes of all the driver and return
> > > > > > string from driver existing rte_flow_error
> > > > > >
> > > > > > See struct rte_flow_error { enum rte_flow_error_type type; /**<
> > > > > > Cause field and error types. */ const void *cause; /**< Object
> > > > > > responsible for the error. */ const char *message; /**<
> > > > > > Human-readable error message. */ };
> > > > > >
> > > > >
> > > > > I think we need a multi-return value API here, as we may add
> > > > > operations in future which have non-error status values to return.
> > > > > The obvious case is DMA engines which support "compare" operations.
> > > > > In that case a successful compare (as in there were no DMA or HW
> > > > > errors) can return "equal" or "not-equal" as statuses. For general
> > > > > "copy" operations, the faster completion op can be used to just
> > > > > return successful values (and only call this status version on
> > > > > error), while apps using those compare ops or a mixture of copy and
> > > > > compare ops, would always use the slower one that returns status
> > > > > values for each and every op..
> > > > >
> > > > > The ioat APIs used 32-bit integer values for this status array so
> > > > > as to allow e.g. 16-bits for error code and 16-bits for future
> > > > > status values. For most operations there should be a fairly small
> > > > > set of things that can go wrong, i.e. bad source address, bad
> > > > > destination address or invalid length.  Within that we may have a
> > > > > couple of specifics for why an address is bad, but even so I don't
> > > > > think we need to start having multiple bit combinations.
> > > >
> > > > OK. What is the purpose of errors status? Is it for application
> > > > printing it or Does the application need to take any action based on
> > > > specific error requests?
> > >
> > > It's largely for information purposes, but in the case of SVA/SVM
> > > errors could occur due to the memory not being pinned, i.e. a page
> > > fault, in some cases. If that happens, then it's up the app to either
> > > touch the memory and retry the copy, or to do a SW memcpy as a
> > > fallback.
> > >
> > > In other error cases, I think it's good to tell the application if it's
> > > passing around bad data, or data that is beyond the scope of hardware,
> > > e.g.  a copy that is beyond what can be done in a single transaction
> > > for a HW instance. Given that there are always things that can go
> > > wrong, I think we need some error reporting mechanism.
> > >
> > > > If the former is scope, then we need to define the standard enum
> > > > value for the error right?  ie. uint32_t *status needs to change to
> > > > enum rte_dma_error or so.
> > > >
> > > Sure. Perhaps an error/status structure either is an option, where we
> > > explicitly call out error info from status info.
> >
> > Agree. Better to have a structure with filed like,
> >
> > 1)  enum rte_dma_error_type 2)  memory to store, informative message on
> > fine aspects of error.  LIke address caused issue etc.(Which will be
> > driver-specific information).
> >
> The only issue I have with that is that once we have driver specific
> information it is of little use to the application, since it can't know
> anything about it excepy maybe log it.  I'd much rather have a set of error
> codes telling user that "source address is wrong", "dest address is wrong",
> and a generic "an address is wrong" in case driver/HW cannot distinguish
> source of error. Can we see how far we get with just error codes before we
> start into passing string messages around and all the memory management
> headaches that implies.

Works for me. It should be "enum rte_dma_error_type" then, which has a standard
error type. Which is missing in the spec now.

>
> /Bruce


More information about the dev mailing list