[PATCH 1/4] latencystats: use alloca instead of vla trivial
Stephen Hemminger
stephen at networkplumber.org
Sun Apr 7 19:00:08 CEST 2024
On Sun, 7 Apr 2024 11:36:59 +0200
Mattias Rönnblom <hofors at lysator.liu.se> wrote:
> On 2024-04-06 17:28, Morten Brørup wrote:
> >> From: Tyler Retzlaff [mailto:roretzla at linux.microsoft.com]
> >> Sent: Thursday, 4 April 2024 19.15
> >>
> >> RFC sample illustrating simple conversion of VLA to alloca().
> >>
> >> Signed-off-by: Tyler Retzlaff <roretzla at linux.microsoft.com>
> >> ---
> >
> > [...]
> >
> >> --- a/lib/latencystats/rte_latencystats.c
> >> +++ b/lib/latencystats/rte_latencystats.c
> >> @@ -159,7 +159,7 @@ struct latency_stats_nameoff {
> >> {
> >> unsigned int i, cnt = 0;
> >> uint64_t now;
> >> - float latency[nb_pkts];
> >> + float *latency = alloca(sizeof(float) * nb_pkts);
> >
> > In cases where we are processing packet bursts, I would prefer introducing a global #define RTE_MAX_PKT_BURST_SIZE, indicating the max packet burst size supported by libraries and drivers.
>
> First question: what is meant by a "packet" here? An mbuf? A
> network-layer PDU? Something that in some way relates to zero or more
> packets, like an rte_event? Or just any object that are sent or receive
> of some DPDK API in batches or bursts?
>
> Second question: is RTE_MAX_PKT_BURST_SIZE meant as an upper bound, so
> no API can consumer or produce a burst larger than this, it does all
> APIs literally have to support that burst size.
>
> Third question: why not just keep VLAs?
>
> > For reference, rte_config.h already has #define RTE_GRAPH_BURST_SIZE 256.
> >
> > Such a common define should also be used by functions such as rte_pktmbuf_free_bulk(); although it also supports segmented packets, so it must still be able to handle more mbufs.
> > https://elixir.bootlin.com/dpdk/v24.03/source/lib/mbuf/rte_mbuf.c#L486
> >
Looking at the maths here, calc_lantency can be seriously improved:
- the calc latency is in the fast path. for transmit.
- it is doing floating point math; floating point is much slower than doing
fixed point
- the latency[] array is a temporary, it should be possible to compute
total latency without it.
- it acquires a lock, in order to achieve DPDK level performance of 40 Mpps, it is
necessary to not do absolute minimum of locking.
More information about the dev
mailing list