[dpdk-dev] Making space in mbuf data-structure

Olivier MATZ olivier.matz at 6wind.com
Mon Jul 7 12:19:14 CEST 2014


Hello Bruce,

Thank you to revive this discussion now that the 1.7 is released.

First, I would like to reference my previous patch series that first
reworks the mbuf to gain 9 bytes [1]. The v1 of the patch was discussed
at [2].

Now, let's list what I would find useful to have in this mbuf rework:

- larger size for ol_flags: this is at least needed for TSO, but as it
   is completely full today, I expect to have this need for other
   features.
- add other offload fields: l4_len and mss, required for TSO
- remove ctrl_mbuf: they could be replaced by a packet mbuf. It will
   simplify the mbuf structure. Moreover, it would allow to save room
   in the mbuf.
- a new vlan tag, I suppose this could be useful in some use-cases
   where vlans are stacked.
- splitting out fields that are superimposed: if 2 features can be used
   at the same time

On the other hand, I'm not convinced by this:

- new filters in the i40e driver: I don't think the mbuf is the
   right place for driver-specific flags. If a feature is brought
   by a new driver requiring a flag in mbuf, we should take care that
   the flag is not bound to this particular driver and would match
   the same feature in another driver.
- sequence number: I'm not sure I understand the use-case, maybe this
   could stay in a mbuf meta data in the reordering module.

> Firstly, we believe that there is no possible way that we can ever fit
> all the fields we need to fit into a 64-byte mbuf, and so we need to
> start looking at a 128-byte mbuf instead.

The TSO patches show that it is possible to keep a 64 bytes mbuf (of
course, it depends on what we want to add in the mbuf). I'm not
fundamentally against having 128 bytes mbuf. But:

- it should not be a reason for just adding things and not reworking
   things that could be enhanced
- it should not be a reason for not optimizing the current mbuf
   structure
- if we can do the same with a 64 bytes mbuf, we need to carefuly
   compare the solutions as fetching a second cache line is not
   costless in all situations. The 64 bytes solution I'm proposing
   in [1] may cost a bit more in CPU cycles but avoids an additional
   cache prefetch (or miss). In some situations (I'm thinking about
   use-cases where we are memory-bound, e.g. an application processing
   a lot of data), it is better to loose a few CPU cycles.

> First off the blocks is to look at moving the mempool pointer into
> the second cache line
> [...]
> Beyond this change, I'm also investigating potentially moving the "next"
> pointer to the second cache line, but it's looking harder to move
> without serious impact

I think we can easily find DPDK applications that would use the "next"
field of the mbuf on rx side, as it is the standard way of chaining
packets. For instance: IP reassembly, TCP/UDP socket queues, or any
other protocol that needs a reassembly queue. This is at least what
we do in 6WINDGate fast path stack, and I suppose other network stack
implementations would do something similar, so we should probably avoid
moving this field to the 2nd cache line.

One more issue I do foresee, with slower CPUs like Atom, having 2
cache lines will add more cost than on Xeon. I'm wondering if it
make sense to have a compilation time option to select either limited
features with one cache line or full features 2 line caches. I don't
know if it's a good idea because it would make the code more complex,
but we could consider it. I think we don't target binary compatibility
today?

 From a functional point of view, we could check that my TSO
patch can be adapted to your proposal so we can challenge and merge
both approaches.

As this change would impact the core of DPDK, I think it would be
interesting to list some representative use-cases in order to evaluate
the cost of each solution. This will also help for future modifications,
and could be included in a sort of non-regression test?

Regards,
Olivier

[1] http://dpdk.org/ml/archives/dev/2014-May/002537.html
[2] http://dpdk.org/ml/archives/dev/2014-May/002322.html



More information about the dev mailing list