[dpdk-dev] mbuf changes

Morten Brørup mb at smartsharesystems.com
Tue Oct 25 14:49:38 CEST 2016


Comments inline.

Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Shreyansh Jain [mailto:shreyansh.jain at nxp.com]
> Sent: Tuesday, October 25, 2016 1:54 PM
> To: Bruce Richardson
> Cc: Wiles, Keith; Morten Brørup; dev at dpdk.org; Olivier Matz
> Subject: Re: mbuf changes
> 
> On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote:
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> >>
> >>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> <mb at smartsharesystems.com> wrote:
> >>>
> >>> First of all: Thanks for a great DPDK Userspace 2016!
> >>>
> >>>
> >>>
> >>> Continuing the Userspace discussion about Olivier Matz’s proposed
> mbuf changes...
> >
> > Thanks for keeping the discussion going!
> >>>
> >>>
> >>>
> >>> 1.
> >>>
> >>> Stephen Hemminger had a noteworthy general comment about keeping
> metadata for the NIC in the appropriate section of the mbuf: Metadata
> generated by the NIC’s RX handler belongs in the first cache line, and
> metadata required by the NIC’s TX handler belongs in the second cache
> line. This also means that touching the second cache line on ingress
> should be avoided if possible; and Bruce Richardson mentioned that for
> this reason m->next was zeroed on free().
> >>>
> > Thinking about it, I suspect there are more fields we can reset on
> free
> > to save time on alloc. Refcnt, as discussed below is one of them, but
> so
> > too could be the nb_segs field and possibly others.
> >
> >>>
> >>>
> >>> 2.
> >>>
> >>> There seemed to be consensus that the size of m->refcnt should
> match the size of m->port because a packet could be duplicated on all
> physical ports for L3 multicast and L2 flooding.
> >>>
> >>> Furthermore, although a single physical machine (i.e. a single
> server) with 255 physical ports probably doesn’t exist, it might
> contain more than 255 virtual machines with a virtual port each, so it
> makes sense extending these mbuf fields from 8 to 16 bits.
> >>
> >> I thought we also talked about removing the m->port from the mbuf as
> it is not really needed.
> >>
> > Yes, this was mentioned, and also the option of moving the port value
> to
> > the second cacheline, but it appears that NXP are using the port
> value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> 
> I am not sure where NXP's NIC came into picture on this, but now that
> it
> is highlighted, this field is required for libevent implementation [1].
> 
> A scheduler sending an event, which can be a packet, would only have
> information of a flow_id. From this matching it back to a port, without
> mbuf->port, would be very difficult (costly). There may be way around
> this but at least in current proposal I think port would be important
> to
> have - even if in second cache line.
> 

Since it's a natural part of the RX handler, it rightfully belongs in the first cache line.

Admitting that I haven't read the referenced proposal in detail, I would like to reiterate that perhaps the Flow Director fields in the mbuf could be used instead of the port field.

> But, off the top of my head, as of now it is not being used for any
> specific purpose in NXP's PMD implementation.
> 
> Even the SoC patches don't necessarily rely on it except using it
> because it is available.
> 
> @Bruce: where did you get the NXP context here from?
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048592.html
> 
> >
> >>>
> >>>
> >>>
> >>> 3.
> >>>
> >>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port
> to the second cache line, which then generated questions from the
> audience about the real life purpose of m->port, and if m->port could
> be removed from the mbuf structure.
> >>>
> >>>
> >>>
> >>> 4.
> >>>
> >>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0
> on first allocation. This is based on the assumption that other mbuf
> fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is
> cheaper than setting it to 1.
> >>>
> >>> Furthermore (regardless of m->refcnt offset), I suggested that it
> is not required to modify m->refcnt when allocating and freeing the
> mbuf, thus saving one write operation on both alloc() and free().
> However, this assumes that m->refcnt debugging, e.g. underrun
> detection, is not required.
> >
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other
> fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the
> situation
> > where we have two references to the mbuf, so refcnt is 1, and both
> are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We
> can
> > then set this back to zero before freeing, however, I'd still prefer
> to
> > have refcnt be an accurate value so that it always stays positive,
> and
> > we can still set it to "one" on free to avoid having to set on alloc.
> >
> > Also, if we set refcnt on free rather than alloc, it does set itself
> up
> > as a good candidate for moving to the second cacheline. Fast-path
> > processing does not normally update the value.
> >
> >>>
> >>>
> >>>
> >>> 5.
> >>>
> >>> And here’s something new to think about:
> >>>
> >>> m->next already reveals if there are more segments to a packet.
> Which purpose does m->nb_segs serve that is not already covered by m-
> >next?
> >
> > It is duplicate info, but nb_segs can be used to check the validity
> of
> > the next pointer without having to read the second mbuf cacheline.
> >
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> >
> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I
> think
> > it can be in a union with the sequence number value, but I also
> suspect
> > that 32-bits of a timestamp is not going to be enough for many.
> >
> > Thoughts?
> >
> > /Bruce
> >
> 
> 
> --
> -
> Shreyansh



More information about the dev mailing list