[dpdk-dev] [PATCH] dmadev: introduce DMA device library

Bruce Richardson bruce.richardson at intel.com
Fri Jul 9 11:14:35 CEST 2021


On Fri, Jul 09, 2021 at 12:05:40AM +0530, Jerin Jacob wrote:
> On Thu, Jul 8, 2021 at 8:41 AM fengchengwen <fengchengwen at huawei.com> wrote:
> >
> 
> > >>>
> > >>> It's just more conditionals and branches all through the code. Inside the
> > >>> user application, the user has to check whether to set the flag or not (or
> > >>> special-case the last transaction outside the loop), and within the driver,
> > >>> there has to be a branch whether or not to call the doorbell function. The
> > >>> code on both sides is far simpler and more readable if the doorbell
> > >>> function is exactly that - a separate function.
> > >>
> > >> I disagree. The reason is:
> > >>
> > >> We will have two classes of applications
> > >>
> > >> a) do dma copy request as and when it has data(I think, this is the
> > >> prime use case), for those,
> > >> I think, it is considerable overhead to have two function invocation
> > >> per transfer i.e
> > >> rte_dma_copy() and rte_dma_perform()
> > >>
> > >> b) do dma copy when the data is reached to a logical state,  like copy
> > >> IP frame from Ethernet packets or so,
> > >> In that case, the application will have  a LOGIC to detect when to
> > >> perform it so on the end of
> > >> that rte_dma_copy() flag can be updated to fire the doorbell.
> > >>
> > >> IMO, We are comparing against a branch(flag is already in register) vs
> > >> a set of instructions for
> > >> 1) function pointer overhead
> > >> 2) Need to use the channel context again back in another function.
> > >>
> > >> IMO, a single branch is most optimal from performance PoV.
> > >>
> > > Ok, let's try it and see how it goes.
> >
> > Test result show:
> > 1) For Kunpeng platform (ARMv8) could benefit very little with doorbell in flags
> > 2) For Xeon E5-2690 v2 (X86) could benefit with separate function
> > 3) Both platform could benefit with doorbell in flags if burst < 5
> >
> > There is a performance gain in small bursts (<5). Given the extensive use of bursts
>  in DPDK applications and users are accustomed to the concept, I do
> not recommend
> > using the 'doorbell' in flags.
> 
> There is NO concept change between one option vs other option. Just
> argument differnet.
> Also, _perform() scheme not used anywhere in DPDK. I
> 
> Regarding performance, I have added dummy instructions to simulate the real work
> load[1], now burst also has some gain in both x86 and arm64[3]
> 
> I have modified your application[2] to dpdk test application to use
> cpu isolation etc.
> So this is gain in flag scheme ad code is checked in to Github[2[
> 
<snip>

The benchmark numbers all seem very close between the two schemes. On my
team we pretty much have test ioat & idxd drivers ported internally to the
last dmadev draft library, and have sample apps handling traffic using
those. I'll therefore attempt to get these numbers with real traffic on
real drivers to just double check that it's the same as these
microbenchmarks.

Assuming that perf is the same, how to resolve this? Some thoughts:
* As I understand it, the main objection to the separate doorbell function
  is the use of 8-bytes in fastpath slot. Therefore I will also attempt to
  benchmark having the doorbell function not on the same cacheline and check
  perf impact, if any.
* If we don't have a impact to perf by having the doorbell function inside
  the regular "ops" rather than on fastpath cacheline, there is no reason
  we can't implement both schemes. The user can then choose themselves
  whether to doorbell using a flag on last item, or to doorbell explicitly
  using function call.

Of the two schemes, and assuming they are equal, I do have a preference for
the separate function one, primarily from a code readability point of view.
Other than that, I have no strong opinions.

/Bruce


More information about the dev mailing list