[dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type

Richardson, Bruce bruce.richardson at intel.com
Thu May 21 14:12:18 CEST 2015



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Neil Horman
> Sent: Wednesday, May 20, 2015 7:47 PM
> To: Marc Sune
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCHv2 0/2] pktdev as wrapper type
> 
> On Wed, May 20, 2015 at 07:01:02PM +0200, Marc Sune wrote:
> >
> >
> > On 20/05/15 12:28, Neil Horman wrote:
> > >On Wed, May 20, 2015 at 12:05:00PM +0200, Marc Sune wrote:
> > >>
> > >>On 20/05/15 10:31, Thomas Monjalon wrote:
> > >>>2015-05-19 12:31, Bruce Richardson:
> > >>>>On Mon, May 11, 2015 at 05:29:39PM +0100, Bruce Richardson wrote:
> > >>>>>Hi all,
> > >>>>>
> > >>>>>after a small amount of offline discussion with Marc Sune, here
> > >>>>>is an alternative proposal for a higher-level interface - aka
> > >>>>>pktdev - to allow a common Rx/Tx API across device types handling
> > >>>>>mbufs [for now, ethdev, ring and KNI]. The key code is in the
> > >>>>>first patch fo the set - the second is an example of a trivial
> usecase.
> > >>>>>
> > >>>>>What is different about this to previously:
> > >>>>>* wrapper class, so no changes to any existing ring, ethdev
> > >>>>>implementations
> > >>>>>* use of function pointers for RX/TX with an API that maps to
> ethdev
> > >>>>>   - this means there is little/no additional overhead for ethdev
> calls
> > >>>>>   - inline special case for rings, to accelerate that. Since we
> are at a
> > >>>>>     higher level, we can special case process some things if
> appropriate. This
> > >>>>>     means the impact to ring ops is one (predictable) branch per
> > >>>>>burst
> > >>>>>* elimination of the queue abstraction. For the ring and KNI, there
> is no
> > >>>>>   concept of queues, so we just wrap the functions directly (no
> need even for
> > >>>>>   wrapper functions, the api's match so we can call directly).
> This also
> > >>>>>   means:
> > >>>>>   - adding in features per-queue, is far easier as we don't need
> to worry about
> > >>>>>     having arrays of multiple queues. For example:
> > >>>>>   - adding in buffering on TX (or RX) is easier since again we
> only have a
> > >>>>>     single queue.
> > >>>>>* thread safety is made easier using a wrapper. For a MP ring, we
> can create
> > >>>>>   multiple pktdevs around it, and each thread will then be able to
> use their
> > >>>>>   own copy, with their own buffering etc.
> > >>>>>
> > >>>>>However, at this point, I'm just looking for general feedback on
> > >>>>>this as an approach. I think it's quite flexible - even more so
> > >>>>>than the earlier proposal we had. It's less proscriptive and
> doesn't make any demands on any other libs.
> > >>>>>
> > >>>>>Comments/thoughts welcome.
> > >>>>Any comments on this RFC before I see about investing further time
> > >>>>in it to clean it up a bit and submit as a non-RFC patchset for
> merge in 2.1?
> > >>>I would say there are 2 possible approaches for KNI and ring
> handling:
> > >>>1/ You Bruce, Marc and Keith are advocating for a layer on top of
> > >>>ethdev, ring, KNI and possibly other devices, which uses mbuf. The
> > >>>set of functions is simpler than ethdev but the data structure is
> > >>>mbuf which is related to ethdev layer.
> > >>>2/ Konstantin and Neil talked about keeping mbuf for ethdev layer
> > >>>and related libs only. Ring and KNI could have an ethdev API with a
> > >>>reduced set of implemented functions. Crypto devices could adopt a
> > >>>specific crypto API and an ethdev API at the same time.
> > >>I don't fully understand which APIs you meant by non-ethdev. This
> > >>pktdev wrapper proposal abstracts RX and TX functions only, and all
> > >>of these are using mbufs as the packet buffer abstraction right now
> anyway (ethdev).
> > >>
> > >He's referring to future device classes (like crypto devices), which
> > >ostensibly would make use of the pktdev API.  My argument (and I
> > >think Thomas') is that if a bit of hardware can be made to operate as
> > >a packet sending/receiving device, then its just as reasonable to use
> > >the existing ethdev api rather than some other restricted version of
> > >it (pktdev)
> > >
> > >>This approach does not preclude that different libraries expose
> > >>other API calls. In fact they will have to; setup the port/device
> > >>... It is just a higher level API, so that you don't have to check
> > >>the type of port in your DPDK application I/O loop, minimizing user's
> code.
> > >>
> > >No argument there.  But if thats the case (and I agree that it is),
> > >an application will implicitly have to know what what type of device
> > >it is, because it (the application) will need to understand the
> specific API it is writing to.
> > >
> > >>Or were you in 2) thinking about creating a different "packet buffer"
> > >>abstraction, independent from the ethdev, and then map the different
> > >>port specifics (e.g. mbuf) to this new abstraction?
> > >>
> > >My argument was to just leave the ethdev api alone.  If a device
> > >class can be made to look like a packet forwarding device, then use
> > >the existing ethdev api to implement it.
> > >
> > >>>I feel it's cleaner, more generic and more maintainable to have
> > >>>drivers implementing one or several stable APIs instead of having
> > >>>some restricted wrappers to update.
> > >>This would be a separate library _on top_ of the existing APIs, and
> > >>it has the advantage to simplify the DPDK user's application code
> > >>when an application needs to deal with several types of port, as
> > >>shown in the example that Bruce provided in PATCH #2.
> > >>
> > >But thats already the purpose of the ethdev api.  Different types of
> > >hardware/software can be made to look like the same thing (an ethdev)
> > >from an application standpoint.  Adding this pktdev layer does
> > >nothing but that, add a layer.  If you want restricted functionality
> > >of an interface, thats ok, ethdev offers that ability.  unimplemented
> > >methods in a pmd cause the ethdev api to return EOPNOTSUP to the
> > >calling application, so the application knows when a given ethdev can't
> do some aspect of what an ethdev is.
> >
> > Hi Neil,
> >
> > Thanks for the clarifications. Now I understand the concern Thomas
> > expressed. Using ethdev API (port-ids) was actually my first
> > suggestion
> > here:
> >
> > http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/13545
> >
> > And to be honest, what I was expecting when I was reading for the
> > first time DPDK's APIs. It is indeed an option. However, if we take a
> look at the API:
> >
> > http://www.dpdk.org/doc/api/rte__ethdev_8h.html
> >
> > none of the API calls, except the burst RX/TX and, perhaps, the
> > callbacks, would be used by devices other than NICs. It seems going a
> > bit too far using it, but ofc possible.
> >
> So, I'll make 3 counter-arguments here:
> 
> 1) To your point about the ethdev api being much larger than what a non-
> ethernet device could use, I'll tacitly agree, but indicate that its not
> relevant.  If you want a bit of hardware that isn't a network interface to
> behave like a network interface, then there are going to be alot of
> aspects of a network interface that it just can't do.  Thats true
> regardless of how you implement that.  In the pktdev model, you prevent
> those operations from being an option at all, while in the current ethdev
> model, you simply get a return code of EOPNOTSUP, and the application does
> the right thing (which is to say, it understands that this hardware
> doesn't need that aspect of network card mangement and goes on with its
> day).  I assert that, because we already have the ethdev api, its a lower
> time investment to simply reuse it
> 
> 2) To the implication that we aren't working with NICs here, you're
> correct.  As you note in your previous message, the pktdev interface is in
> no way the end all and be all of device model design.  You will need to
> add other api calls to manage the device.  If thats the case, then don't
> shoehorn any one particular aspect of the API to fit a device model that
> the device doesn't conform to.
> Design the API so that it best reflects the hardware behavior.
> 
> 
> 3) An addendum to the point about hardware not being a NIC (and you didn't
> make this point directly above, but I think you may have mentioned it
> previously), sometimes you want a device to behave like another device for
> the purposes of using generic code to talk to several device types.  While
> this is true, this is a case for device translation and use, not for
> carving out parts of an api to make something more generic.  The use case
> I cited previously was an ipsec tunnel.  An ipsec tunnel uses
> cryptography, and crypto device apis to encrypt decrypt packet data.  The
> common way to implement this is to design a crypto api that accepts a
> block of data in a way most condusive to the hardware, and then implement
> a network driver (that uses whatever ethernet api, in this case the ethdev
> api), to integrate with the network datapath.  With this model, the ipsec
> tunnel uses the full range of the ethdev api (or a good deal more of it),
> and the crypto api is optimized to work with crypo acceleration hardware.
> 
> > In essence, rte_ether(rte_ethdev.h) right now has: i) NIC setup;
> > general configuration, queue config, fdir, offloads, hw stuff like
> > leds... ii) RX/TX routines and callbacks iii) Stats and queue stats
> > iv) other utils for ethernet stuff (rte_ether.h)
> >
> The key that I'm taking away here is 'right now'.  Its already written, so
> theres no work involved in implementing it for new devices.
> 
> > i) is clearly HW specific, and does only apply to NICs/ASICs (e.g.
> > FM10k)
> Ok, so it only applies to NIC's, thats fine.  If you want to write a
> driver that leaves those methods for the pmd set to NULL, the ethdev
> library will correctly return EOPNOTSUPP to the calling applications.
> 
> > while ii) and iii) are things that could be abstracted beyond NICs,
> > like KNI, rte_ring, crypto... (iv could be moved into some
> > utils/protocol parsing libraries).
> >
> Right again, so let those device types implement the appropriate portions
> of the pmd driver structure that match to what they support.  EVerything
> else is handled by the ethdev library automatically.
> 
> > Perhaps these two groups could be split into two different libraries
> > and then ii) and iii) together would be something like ~ rte_pktdev
> > (stats are missing on the proposed patch), while i) would be
> > rte_ether, or rte_nic if we think it is a better name.
> >
> The point I'm trying to get to is, why split at all?  Theres just no need
> that I can see. The example I would set here is the dummy driver in linux.
> Its a net device that only serves to act as a sink for network packets.
> It still uses the network driver interface, but of the 65-ish methods that
> the netdevice model in linux offers, it implements 8 (or approximately
> 12%).  The other unused method are just that, unused, and thats ok.
> Applications that try to do things like set flow director options, or
> speed/duplex options gets a return code that effectively says "This device
> can't do that", and thats ok.  Thats what we need to be doing here.
> Instead of finding a way to codify the subset of functionality that other
> devices might be able to implement, for those cases where we want other
> hardware to act like a netdevice, lets just let those devices pick and
> choose what to implement, and the interface we already have will
> communicate with applications appropriately.
> 
> Regards
> Neil
> 

Hi Neil,

First off, a note on the naming and the basic concept: this proposal is not trying to make everything look like NIC, rather we are trying to make a bunch of different components appear as generic sources/sinks for pkts or mbufs. From my point of view, it's an important difference.

Be that as it may, I'd like to first deal with the whole idea of the application needing to know about the type of the underlying device. For me, this is a critical point. Applications - such as all our sample apps - have essentially two parts:
* an initialization and control part
* a data-path part.
These two parts are very, very different in what they do. The initialization part - which e.g. in testpmd continues on in the form of the cmdline interface as a control part - does the initial setup of devices/rings/etc. and potentially makes use of the full APIs provided by the ethdev interface. It's also not performance critical, as evidenced by the fact that the APIs used there have additional checks for valid input etc. The second, data-path part, is entirely the part that this proposal is targeting. This data-path is completely separate in the application, is highly performance sensitive, and rarely, if ever needs to know or care about the actual source of its data. So the idea behind this library is that you can write your initialization control parts of your app as-now, fully aware of the underlying types involved, and without ever using rx/tx burst. Then when you have the various devices and DPDK objects set up, you spawn your data-path threads and pass each one the set of input and outputs it needs, in the form of your generic packet source objects.

This distinction is also why I'm not particularly interested in the ability to pass in different objects via cmdline, as is done now with pcap/ring PMDs. That's ok when you want the initialization part of your app to be oblivious to see everything under a common abstraction, but when it's only the data path you want to work with generic packet objects that's unnecessary, and the initialization path should be able to convert any of the required input/output sources to a generic type using a single API call. [This doesn't rule out specifying different inputs/outputs on the commandline, it's just you can specify them as their native types, rather than hiding them under a common API at the control-path level].

As for what that abstraction should be. There are a number of issues I see with ethdev - as it is right now, as that common abstraction.

1. The use of port-ids. I think port ids are fine for numbering physical ports, but I think pointers are better for passing around objects to be worked on by the data path. What is more concerning [than my opinion on numbers vs pointers :-)] is the fact that we are limited to 256 port ids. Yes, that can be changed, but the impacts are massive. To change the type, we would break the ABI for every single ethdev API, as well as likely other functions too. Furthermore, increasing the size of the port id would require a change to the internals of the mbuf structure, which would lead to the ABI being broken for any function that uses mbufs. By adopting an API, such as proposed, which uses pointers, we avoid the problem, as port ids would only apply to ethdevs.

2. Simplicity. While you say that its fine for an ethdev not to implement all the functions in the ethdev API, to create a proper PMD like you are proposing involves a good deal more work than using the proposed pktdev abstraction. If it's to appear like a proper NIC to the control paths, as well as the init paths - which seems to be what you imply - you really do need to implement additional functions like queue setup, and start and stop. While it's true that the library can return -ENOSUP on an unsupported function, I don't believe any of our sample apps are set up to check for this on NIC setup, and therefore I would hazard a guess that real-world customer apps aren't set up to handle it either.

3. Performance for rings. While not applicable for all cases, the performance of the rings under an ethdev abstraction would not be the same as here. For example, when polling on an empty ring for packets, the current time taken by our ring rx/tx functions, is literally a few cycles (as tested by the rings autotest). If these functions cannot be inlined, that cycle count goes up to 3x what it is now. [I observed this previously when doing reworking of the rings code, and the code-size led to icc no longer doing inlining. In that case, the gcc code for empty polling was indeed 3 times faster than the icc version. Adding forced inlining made things equal again]. This metric of empty polling may seem trivial i.e. "if there are no packets, why does it matter how long it takes?", but is important in real-world cases where you are pulling packets from multiple sources, and your application is only currently dealing with input on one of them. [Often tested to see how an application handles in a single-flow situation - an metric our customers do look at]. Even in the non-empty situation, for smaller packet bursts, the overhead of the function call may slow things down. [For larger bursts, e.g. 32, the effect should not be noticeable, I suspect].

The only other final point I'd make here is that what is proposed is not proscriptive - whatever a future API for handling other device types, such as crypto devices, may look like can be decided separately from this pktdev implementation. Whether one chooses pktdev or ethdev as a common abstraction layer type, the decision of whether or not a particular object type is allowed to be made look like that common type can be made entirely independently, and based upon whether or not such a type-conversion makes sense.

Regards,
/Bruce


More information about the dev mailing list