[dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection

Ananyev, Konstantin konstantin.ananyev at intel.com
Wed Apr 11 02:25:31 CEST 2018



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Yongseok Koh
> Sent: Tuesday, April 10, 2018 2:59 AM
> To: Olivier Matz <olivier.matz at 6wind.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu at intel.com>; Wu, Jingjing <jingjing.wu at intel.com>; adrien.mazarguil at 6wind.com;
> nelio.laranjeiro at 6wind.com; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
> 
> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > Hi Yongseok,
> >
> > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > Hi,
> > > >
> > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > rte_pktmbuf_attach_at().
> > > > >
> > > > > Possible use-cases could be:
> > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > >
> > > > > Signed-off-by: Yongseok Koh <yskoh at mellanox.com>
> > > >
> > > > I think the current API is already able to do what you want.
> > > >
> > > > 1/ Here is a mbuf m with its data
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >
> > > > 2/ clone m:
> > > >
> > > >   c = rte_pktmbuf_alloc(pool);
> > > >   rte_pktmbuf_attach(c, m);
> > > >
> > > >   Note that c has its own offset and length fields.
> > > >
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +------^----------------------+
> > > >                |
> > > >           +----+
> > > > indirect  |
> > > >         +-|---------------------------+
> > > >         | |    -----------------------|
> > > > c       | buf  |                     ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >                 off    len
> > > >                 <--><---------->
> > > >
> > > >
> > > > 3/ remove some data from c without changing m
> > > >
> > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > >
> > > >
> > > > Please let me know if it fits your needs.
> > >
> > > No, it doesn't.
> > >
> > > Trimming head and tail with the current APIs removes data and make the space
> > > available. Adjusting packet head means giving more headroom, not shifting the
> > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > difference offsets in m,
> > >
> > > rte_pktmbuf_adj(c1, 10);
> > > rte_pktmbuf_adj(c2, 20);
> > >
> > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > wants to attach outer header, it will overwrite the headroom even though the
> > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > header should be linked by h1->next = c2.
> >
> > Yes, after these operations c1, c2 and m should become read-only. So, to
> > prepend headers, another mbuf has to be inserted before as you suggest. It
> > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > length) that will:
> >   - alloc and attach indirect mbuf for each segment of m that is
> >     in the range [offset : length+offset].
> >   - prepend an empty and writable mbuf for the headers
> >
> > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > which actually shrink the headroom, this case can be properly handled.
> >
> > What do you mean by properly handled?
> >
> > Yes, prepending data or adding data in the indirect mbuf won't override
> > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > won't be protected.
> >
> > From an application point of view, indirect mbufs, or direct mbufs that
> > have refcnt != 1, should be both considered as read-only because they
> > may share their data. How an application can know if the data is shared
> > or not?
> >
> > Maybe we need a flag to differentiate mbufs that are read-only
> > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > understanding is correct, you want to have indirect mbufs with RW data.
> 
> Agree that indirect mbuf must be treated as read-only, Then the current code is
> enough to handle that use-case.
> 
> > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > indirect referencing.

But just to make HW to RX multiple packets into one mbuf,
data_off inside indirect mbuf should be enough, correct?
As I understand, what you'd like to achieve with this new field -
ability to manipulate packet boundaries after RX, probably at upper layer.
As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
indirect mbufs trying to modify same direct buffer.
Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
Fields for indirect mbufs, straight after attach()?
Konstantin

> > >
> > > Does this make sense?
> >
> > I understand the need.
> >
> > Another option would be to make the mbuf->buffer point to an external
> > buffer (not inside the direct mbuf). This would require to add a
> > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > a quick overview.
> >
> > [1]
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> -Session05-OlivierMatz-
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> >
> > The advantage is that it does not require the large data to be inside a
> > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > allocated from a mempool). On the other hand, it is maybe more complex
> > to implement compared to your solution.
> 
> I knew that you presented the slides and frankly, I had considered that option
> at first. But even with that option, metadata to store refcnt should also be
> allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> end of the data segment. Even though it could have smaller metadata structure,
> I just wanted to make full use of the existing framework because it is less
> complex as you mentioned. Given that you presented the idea of external data
> buffer in 2016 and there hasn't been many follow-up discussions/activities so
> far, I thought the demand isn't so big yet thus I wanted to make this patch
> simpler.  I personally think that we can take the idea of external data seg when
> more demands come from users in the future as it would be a huge change and may
> break current ABI/API. When the day comes, I'll gladly participate in the
> discussions and write codes for it if I can be helpful.
> 
> Do you think this patch is okay for now?
> 
> 
> Thanks for your comments,
> Yongseok


More information about the dev mailing list