[dpdk-dev] Making space in mbuf data-structure
    Richardson, Bruce 
    bruce.richardson at intel.com
       
    Tue Jul 15 20:07:06 CEST 2014
    
    
  
> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, July 15, 2014 2:31 AM
> To: Richardson, Bruce; dev at dpdk.org
> Subject: RE: Making space in mbuf data-structure
> 
> Hi Bruce,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Richardson, Bruce
> > Sent: Friday, July 04, 2014 12:39 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] Making space in mbuf data-structure
> >
> > Hi all,
> >
> > At this stage it's been well recognised that we need to make more space in the
> mbuf data structure for new fields. We in Intel have
> > had some discussion on this and this email is to outline what our current
> thinking and approach on this issue is, and look for additional
> > suggestions and feedback.
> >
> > Firstly, we believe that there is no possible way that we can ever fit all the
> fields we need to fit into a 64-byte mbuf, and so we need to
> > start looking at a 128-byte mbuf instead. Examples of new fields that need to
> fit in, include - 32-64 bits for additional offload flags, 32-
> > bits more for filter information for support for the new filters in the i40e
> driver, an additional 2-4 bytes for storing info on a second
> > vlan tag, 4-bytes for storing a sequence number to enable packet reordering in
> future, as well as potentially a number of other fields
> > or splitting out fields that are superimposed over each other right now, e.g. for
> the qos scheduler. We also want to allow space for use
> > by other non-Intel NIC drivers that may be open-sourced to dpdk.org in the
> future too, where they support fields and offloads that
> > our hardware doesn't.
> >
> > If we accept the fact of a 2-cache-line mbuf, then the issue becomes how to
> rework things so that we spread our fields over the two
> > cache lines while causing the lowest slow-down possible. The general
> approach that we are looking to take is to focus the first cache
> > line on fields that are updated on RX , so that receive only deals with one
> cache line. The second cache line can be used for application
> > data and information that will only be used on the TX leg. This would allow us
> to work on the first cache line in RX as now, and have the
> > second cache line being prefetched in the background so that it is available
> when necessary. Hardware prefetches should help us out
> > here. We also may move rarely used, or slow-path RX fields e.g. such as those
> for chained mbufs with jumbo frames, to the second
> > cache line, depending upon the performance impact and bytes savings
> achieved.
> >
> > With this approach, we still need to make space in the first cache line for
> information for the new or expanded receive offloads. First
> > off the blocks is to look at moving the mempool pointer into the second cache
> line. This will free-up 8 bytes in cache line  0, with a field
> > that is only used when cleaning up after TX. A prototype patch for this change
> is given below, for your review and comment. Initial
> > quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=250 --
> txqflags=0xf01 --burst=32 --txrst=32 --txfreet=32 --rxfreet=32 --
> > total-num-mbufs=65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --
> config="(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight
> > performance decrease with testpmd and equal or slightly better performance
> with l3fwd. These would both have been using the
> > vector PMD - I haven't looked to change the fallback TX implementation yet
> for this change, since it's not as performance optimized
> > and hence cycle-count sensitive.
> 
> Regarding code changes itself:
> If I understand your code changes correctly:
> ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs in the
> bulk belong to the same mempool.
> If we have let say 2 groups of mbufs from 2 different mempools - it wouldn't be
> able to use put_bulk.
> While, as I remember,  current implementation is able to use put_bulk() in such
> case.
> I think, it is possible to change you code to do a better grouping.
Given two sets of mbufs from two separate pool this proposed change will use put bulk for the first group encountered, but not the second. We can look at enhancing it to work with different groups, but this version works well where all or most mbufs in a TX queue come from a common queue. In the case of having every second packet or every third packet coming from a different mempool (for the two or three mempool case), this version should also perform better as it doesn't rely on having long runs of packets from the same mempool.
However, this is very much WIP, so we'll see if we can come up with a better solution if we think one is necessary. (Most apps seem to use just one mempool per numa socket at most, though, so I don't think we should over-optimise for multiple mempools).
> 
> >
> > Beyond this change, I'm also investigating potentially moving the "next"
> pointer to the second cache line, but it's looking harder to
> > move without serious impact, so we'll have to see there what can be done.
> Other fields I'll examine in due course too, and send on
> > any findings I may have.
> 
> Could you explain it a bit?
> I always had an impression that moving next pointer to the 2-nd cache line
> would have less impact then moving mempool pointer.
> My thinking was: next is used only in scatter versions of RX/TX which are slower
> than optimised RX/TX anyway.
> So few extra cycles wouldn't be that noticeable.
> Though I admit I never did any measurements for that case.
I was busy with a number of things last week, so work on moving the next pointer slowed somewhat, but I think I have something reasonable now. The trouble with moving the next pointer was that it is being explicitly set to NULL in all our RX code paths, and the TX code path would read but not clear it. Once I examined all our TX and RX code paths - of which there are more than necessary, I feel, I was able to change things so that the NULL pointer is set on free rather than on allocation. The final issue was then to ensure that the rx_scattered_pkts function didn't suffer undue slowdown due to the second cache line, and after some experimentation, an extra prefetch seems to have helped here.
At this point, I think I have a prototype with both pool and next pointers on a second cache line where both vector rx/tx and scalar (full-featured) code paths are performing well, with sub 5% slowdown. However, more testing is needed with a wider range of apps and packet types to confirm this.
> 
> About space savings for the first cache line:
> - I still think that Oliver's suggestion of replacing data pointer (8 bytes) with data
> offset (2 bytes) makes a lot of sense.
Yes, I would tend to agree. I'm going to take this and run additional testing on this change and see how it fits in with everything else. Some concerns were expressed about potential performance impacts of such a change, so I'm going to try and follow those up so that we can get a data-driven answer to this. If no performance impact can be seen (and none has been observed so far, to the best of my knowledge), I'm all in favour of the change.
Also Olivier's change to merge pktmbuf into the main mbuf itself (and eliminate ctrlmbuf) is something that I think we want to keep. It allows us much greater freedom to move fields about to fit things in well, with good alignment.
Regards,
/Bruce
    
    
More information about the dev
mailing list