[dpdk-dev] Minutes of Technical Board Meeting, 2021-03-10

Bruce Richardson bruce.richardson at intel.com
Wed Apr 7 11:58:53 CEST 2021


On Wed, Apr 07, 2021 at 09:11:23AM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Honnappa
> > Nagarahalli
> > Sent: Wednesday, April 7, 2021 2:48 AM
> > 
> > <snip>
> > >
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Tom Barbette
> > > > Sent: Wednesday, March 31, 2021 10:53 AM
> > > >
> > > > Le 31-03-21 à 02:44, Honnappa Nagarahalli a écrit :
> > > > > 	- Ability to tune the values of #defines
> > > > >     * Few prominent points discussed
> > > > > 	- This will result in #ifdefs in the code (for ex: in
> > testpmd)
> > > > > 	- One option is for all the PMDs to document their
> > configurable
> > > > #defines in PMD specific header files. Having these distributed is
> > > > much easier to search.
> > > > > 	- Can some of the existing #defines be converted to runtime
> > > > configurations? For ex: RTE_MAX_LCORE? This might impact ABI.
> > > > >     * Bruce to think about converting the doc to a blog or an
> > email
> > > > on the mailing list. But soliciting feedback is most important.
> > > >
> > > > One alternative path worth looking at is to encourage the use of
> > LTO,
> > > > and modify APIs so the configuration can be provided at linking
> > time,
> > > > and propagated by the compiler.
> > > >
> > > > E.g. one can define rte_max_lcore as a weak constant symbol, equal
> > to
> > > > 128. At linking time the user may provide a rte_max_lcore that is
> > more
> > > > tailored, and still, dynamic arrays[rte_max_lcore] will be
> > allocatable
> > > > on the .bss section, avoiding an indirection. The compiler will be
> > > > able to optimize loops etc which is impossible with pure runtime
> > > > configuration.
> > > >
> > > > In packetmill.io we actually pushed this to the next level where
> > the
> > > > driver can completely change its behavior without recompiling DPDK
> > > > itself and spawning ifdefs everywhere.
> > > >
> > > > However the price is the slowiness of LTO...
> > > >
> > > > My 2 cents.
> > > >
> > > > Tom
> > > >
> > >
> > > If we are moving away from Compile Time parameters, I certainly
> > prefer Tom's
> > > suggestion of Link Time parameters, rather than Run Time parameters.
> > I think compile time constants are fine if they are not used in #ifdef.
> > For ex: if they are used in 'if (...)', it will help eliminate code and
> > branches.
> 
> Yes!
> 
> And "if (...)" is more flexible than #ifdef/#if because it allows the expression to be mixed with non-constants.
> 
> Then perhaps Bruce's script to automatically make C constants out of #defines was not so silly anyway. :-)
> 
> > 
> > >
> > > This might also provide a middle ground for optimizations where
> > Compile
> > > Time parameters are considered unacceptable by the DPDK community.
> > I'm
> > > thinking about something along the lines of the "constant size"
> > rte_event
> > > array presented at the 2020 Userspace Summit by Harry
> > >
> > (https://static.sched.com/hosted_files/dpdkuserspace2020/d3/dpdk_usersp
> > ac
> > > e_20_api_performance_hvh.pdf). Taking this thinking even further out,
> > a Link
> > > Time parameter could perhaps replace the nb_pkts parameter in on
> > > optimized rte_eth_rx_burst() function.
> > >
> 
> Optimally, I would like to see e.g. the RX burst size being so constant that the PMD's RX function knows it and can use vector functions and possibly loop unrolling, without having to implement a pre-check on nb_pkts and a trailing non-vector loop for receiving any remaining odd nb_pkts. All the DPDK examples use #define MAX_PKT_BURST 32 or similar, and I assume most DPDK applications do too.
> 
> I do not trust the compiler to be clever enough to realize that the PMD's RX function is always called with a specific nb_pkts and optimize all this cruft away at compile time (or at link time), unless it is a #define or a compile time constant.
> 

It certainly is not possible to do at compile time, because the calls are
in a different compilation unit from the functions themselves, not to
mention that a link-time the RX functions are called via a function
pointer. Therefore the only way to do this that I am aware of, is to have a
wrapper function use for the common values inside the drivers themselves.

For example, inside the i40e driver (which I'm using because it's the one
I'm most familiar with), the main receive function is already a wrapper
around a raw receive function, using constant-expansion by the compiler of
the final parameter (NULL) to automatically remove the code for tracking
scattered packets.

uint16_t
i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
                   uint16_t nb_pkts)
{
	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
}

We can produce a version of this optimized for 32-element dequeues by
special-casing where nb_pkts == 32:

uint16_t
i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts,
                   uint16_t nb_pkts)
{
	if (nb_pkts == 32)
		return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, NULL);
	return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL);
}

Now the compiler when inlining the _raw_ function, can see that it needs
two copies, and for the first, that nb_pkts is compile-time constant of 32.

However, I'm not sure how useful an optimization like this is, and I'd be
interested to see what benefits testing shows. Beyond the loop iteration
count of 32, there is also the check after each burst of 8 dequeues in the
driver to check that we have a full set of 8 - and abort the loop if not.
It's also the case that unless an app is already at maximum load (or
overloaded), one would probably not expect to always get a full set of 32
packets each time, as you have no additional headroom for more.

/Bruce


More information about the dev mailing list