[dpdk-dev] [PATCH v12 0/6] add Tx preparation
    Ananyev, Konstantin 
    konstantin.ananyev at intel.com
       
    Fri Dec  2 02:00:55 CET 2016
    
    
  
Hi Adrien,
> 
> Hi Konstantin,
> 
> On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote:
> [...]
> > > Something is definitely needed here, and only PMDs can provide it. I think
> > > applications should not have to clear checksum fields or initialize them to
> > > some magic value, same goes for any other offload or hardware limitation
> > > that needs to be worked around.
> > >
> > > tx_prep() is one possible answer to this issue, however as mentioned in the
> > > original patch it can be very expensive if exposed by the PMD.
> > >
> > > Another issue I'm more concerned about is the way limitations are managed
> > > (struct rte_eth_desc_lim). While not officially tied to tx_prep(), this
> > > structure contains new fields that are only relevant to a few devices, and I
> > > fear it will keep growing with each new hardware quirk to manage, breaking
> > > ABIs in the process.
> >
> > Well, if some new HW capability/limitation would arise and we'd like to support
> > it in DPDK, then yes we probably would need to think how to incorporate it here.
> > Do you have anything particular in mind here?
> 
> Nothing in particular, so for the sake of the argument, let's suppose that I
> would like to add a field to expose some limitation that only applies to my
> PMD during TX but looks generic enough to make sense, e.g. maximum packet
> size when VLAN tagging is requested.
Hmm, I didn't hear about such limitations so far, but if it is real case -
sure, feel free to submit the patch.   
> PMDs are free to set that field to some
> special value (say, 0) if they do not care.
> 
> Since that field exists however, conscious applications should check its
> value for each packet that needs to be transmitted. This extra code causes a
> slowdown just by sitting in the data path. Since it is not the only field in
> that structure, the performance impact can be significant.
> 
> Even though this code is inside applications, it remains unfair to PMDs for
> which these tests are irrelevant. This problem is identified and addressed
> by tx_prepare().
I suppose the question is why do we need:
uint16_t nb_seg_max;
uint16_t nb_mtu_seg_max;
as we now have tx_prepare(), right?
For two reasons:
1. Some people might feel that tx_prepare() is not good (smart/fast) enough
for them and would prefer to do necessary preparations for TX offloads themselves.
2. Even if people do use tx_prepare() they still should take this information into accout.
As an example ixgbe can't TX packets with then 40 segments.
Obviously ixbge_tx_prep() performs that check and returns an error.
But it wouldn't try to merge/reallocate mbufs for you.
User still has to do it himself, or just prevent creating such long chains somehow.
> 
> Thanks to tx_prepare(), these checks are moved back into PMDs where they
> belong. PMDs that do not need them do not have to provide support for
> tx_prepare() and do not suffer any performance impact as result;
> applications only have to make sure tx_prepare() is always called at some
> point before tx_burst().
> 
> Once you reach this stage, you've effectively made tx_prepare() mandatory
> before tx_burst(). If some bug occurs, then perhaps you forgot to call
> tx_prepare(), you just need to add it. The total cost for doing TX is
> therefore tx_prepare() + tx_burst().
> 
> I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() will
> remain optional for long. Sure, PMDs that do not implement it do not care,
> I'm focusing on applications, for which the performance impact of calling
> tx_prepare() followed by tx_burst() is higher than a single tx_burst()
> performing all the necessary preparation at once.
> 
> [...]
> > > Following the same logic, why can't such a thing be made part of the TX
> > > burst function as well (through a direct call to rte_phdr_cksum_fix()
> > > whenever necessary). From an application standpoint, what are the advantages
> > > of having to:
> > >
> > >  if (tx_prep()) // iterate and update mbufs as needed
> > >      tx_burst(); // iterate and send
> > >
> > > Compared to:
> > >
> > >  tx_burst(); // iterate, update as needed and send
> >
> > I think that was discussed extensively quite a lot previously here:
> > As Thomas already replied - main motivation is to allow user
> > to execute them on different stages of packet TX pipeline,
> > and probably on different cores.
> > I think that provides better flexibility to the user to when/where
> > do these preparations and hopefully would lead to better performance.
> 
> And I agree, I think this use case is valid but does not warrant such a high
> penalty when your application does not need that much flexibility. Simple
> (yet conscious) applications need the highest performance. Complex ones as
> you described already suffer quite a bit from IPCs and won't mind a couple
> of extra CPU cycles right?
It would mean an extra cache-miss for every packet, so I think performance hit
would be quite significant. 
About the 'simple' case when tx_prep() and tx_burst() are called on the same core,
Why do you believe that:
tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(), ...}?
tx_prep() itself is quite expensive, let say for Intel HW it includes:
- read mbuf fileds (2 cache-lines),
- read packet header (1/2 cache-lines)
- calculate pseudo-header csum
 - update packet header 
Comparing to that price of extra function call seems neglectable
(if we TX packets in bursts of course). 
> 
> Yes they will, therefore we need a method that satisfies both cases.
> 
> As a possible solution, a special mbuf flag could be added to each mbuf
> having gone through tx_prepare(). That way, tx_burst() could skip some
> checks and things it would otherwise have done.
That's an interesting idea, but it has one drawback:
As I understand, it means that from now on if user doing preparations on his own,
he had to setup this flag, otherwise tx_burst() would do extra unnecessary work.
So any existing applications that using TX offloads and do preparation by themselves
would have to be modified to avoid performance loss.
> 
> Another possibility, telling the PMD first that you always intend to use
> tx_prepare() and getting a simpler/faster tx_burst() callback as a result.
That what we have right now (at least for Intel HW):  
it is a user responsibility to do the necessary preparations/checks before calling tx_burst().  
With tx_prepare() we just remove from user the headache to implement tx_prepare() on his own.
Now he can use a 'proper' PMD provided function.
My vote still would be for that model.
Konstantin
    
    
More information about the dev
mailing list