[PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts
Mattias Rönnblom
hofors at lysator.liu.se
Wed May 17 09:16:24 CEST 2023
On 2023-05-16 15:08, Jerin Jacob wrote:
> On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hofors at lysator.liu.se> wrote:
>>
>> On 2023-05-15 14:38, Jerin Jacob wrote:
>>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
>>> <mattias.ronnblom at ericsson.com> wrote:
>>>>
>>>> On 2023-05-12 13:59, Jerin Jacob wrote:
>>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
>>>>> <mattias.ronnblom at ericsson.com> wrote:
>>>>>>
>>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>>>>>> dequeue only when the burst size is compile-time constant (and equal
>>>>>> to one).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> v3: Actually include the change v2 claimed to contain.
>>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>>>>> application is compiled with -pedantic. (Morten Brørup)
>>>>>> ---
>>>>>> lib/eventdev/rte_eventdev.h | 4 ++--
>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>>>>>> index a90e23ac8b..a471caeb6d 100644
>>>>>> --- a/lib/eventdev/rte_eventdev.h
>>>>>> +++ b/lib/eventdev/rte_eventdev.h
>>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>>>>>> * Allow zero cost non burst mode routine invocation if application
>>>>>> * requests nb_events as const one
>>>>>> */
>>>>>> - if (nb_events == 1)
>>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>
>>>>> "Why" part is not clear from the commit message. Is this to avoid
>>>>> nb_events read if it is built-in const.
>>>>
>>>> The __builtin_constant_p() is introduced to avoid having the compiler
>>>> generate a conditional branch and two different code paths in case
>>>> nb_elem is a run-time variable.
>>>>
>>>> In particular, this matters if nb_elems is run-time variable and varies
>>>> between 1 and some larger value.
>>>>
>>>> I should have mention this in the commit message.
>>>>
>>>> A very slight performance improvement. It also makes the code better
>>>> match the comment, imo. Zero cost for const one enqueues, but no impact
>>>> non-compile-time-constant-length enqueues.
>>>>
>>>> Feel free to ignore.
>>>
>>>
>>> I did some performance comparison of the patch.
>>> A low-end ARM machines shows 0.7% drop with single event case. No
>>> regression see with high-end ARM cores with single event case.
>>>
>>> IMO, optimizing the check for burst mode(the new patch) may not show
>>> any real improvement as the cost is divided by number of event.
>>> Whereas optimizing the check for single event case(The current code)
>>> shows better performance with single event case and no regression
>>> with burst mode as cost is divided by number of events.
>>
>> I ran some tests on an AMD Zen 3 with DSW.
>> In the below tests the enqueue burst size is not compile-time constant.
>>
>> Enqueue burst size Performance improvement
>> Run-time constant 1 ~5%
>> Run-time constant 2 ~0%
>> Run-time variable 1-2 ~9%
>> Run-time variable 1-16 ~0%
>>
>> The run-time variable enqueue sizes randomly (uniformly) distributed in
>> the specified range.
>>
>> The first result may come as a surprise. The benchmark is using
>> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
>> in most apps). The single-event enqueue function only exists in a
>> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
>> I suspect that is the reason for the performance improvement.
>>
>> This effect is large-enough to make it somewhat beneficial (+~1%) to use
>> run-time variable single-event enqueue compared to keeping the burst
>> size compile-time constant.
>
> # Interesting, Could you share your testeventdev command to test it.
I'm using a proprietary benchmark to evaluate the effect of these
changes. There's certainly nothing secret about that program, and also
nothing very DSW-specific either. I hope to at some point both extend
DPDK eventdev tests to include DSW, and also to contribute
benchmarks/characteristics tests (perf unit tests or as a separate
program), if there seems to be a value in this.
> # By having quick glance on DSW code, following change can be added(or
> similar cases).
> Not sure such change in DSW driver is making a difference or nor?
>
>
> diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
> index e84b65d99f..455470997b 100644
> --- a/drivers/event/dsw/dsw_event.c
> +++ b/drivers/event/dsw/dsw_event.c
> @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
> *dsw, struct dsw_port *source_port)
> uint16_t
> dsw_event_enqueue(void *port, const struct rte_event *ev)
> {
> - return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
> + return dsw_event_enqueue_burst(port, ev, 1);
Good point.
Historical note: I think that comparison is old cruft borne out of a
misconception, that the single-event enqueue could be called directly
from application code, combined with the fact that producer-only ports
needed some way to "maintain" a port, prior to the introduction of
rte_event_maintain().
> }
>
> static __rte_always_inline uint16_t
> @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
> *source_port,
> return (num_non_release + num_release);
> }
>
> -uint16_t
> +inline uint16_t
From what it seems, this does not have the desired effect, at least not
on GCC 11.3 (w/ the default DPDK compiler configuration).
I reached this conclusion when I noticed that if I reshuffle the code so
to force (not hint) the inlining of the burst (and generic burst)
enqueue function into dsw_event_enqueue(), your change performs better.
> dsw_event_enqueue_burst(void *port, const struct rte_event events[],
> uint16_t events_len)
> {
>
> # I am testing with command like this "/build/app/dpdk-test-eventdev
> -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
> --stlist p --nb_pkts=10000000000"
>
I re-ran the compile-time variable, run-time constant enqueue size of 1,
and I got the following:
Jerin's change: +4%
Jerin's change + ensure inlining: +6%
RFC v3: +7%
(Here I use a more different setup that produces more deterministic
results, hence the different numbers compared to the previous runs. They
were using a pipeline spread over two chiplets, and these runs are using
only a single chiplet.)
It seems like with your suggested changes you eliminate most of the
single-enqueue-special case performance degradation (for DSW), but not
all of it. The remaining degradation is very small (for the above case,
larger for small by run-time variable enqueue sizes), but it's a little
sad that a supposedly performance-enhancing special case (that drives
complexity in the code, although not much) actually degrades performance.
>>
>> The performance gain is counted toward both enqueue and dequeue costs
>> (+benchmark app overhead), so an under-estimation if see this as an
>> enqueue performance improvement.
>>
>>> If you agree, then we can skip this patch.
>>>
>>
>> I have no strong opinion if this should be included or not.
>>
>> It was up to me, I would drop the single-enqueue special case handling
>> altogether in the next ABI update.
>
> That's a reasonable path. If we are willing to push a patch, we can
> test it and give feedback.
> Or in our spare time, We can do that as well.
>
Sure, I'll give it a try.
The next release is an ABI-breaking one?
>>
>>>
>>>>
>>>>> If so, check should be following. Right?
>>>>>
>>>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>> || nb_events == 1)
>>>>>
>>>>> At least, It was my original intention in the code.
>>>>>
>>>>>
>>>>>
>>>>>> return (fp_ops->enqueue)(port, ev);
>>>>>> else
>>>>>> return fn(port, ev, nb_events);
>>>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>>>>>> * Allow zero cost non burst mode routine invocation if application
>>>>>> * requests nb_events as const one
>>>>>> */
>>>>>> - if (nb_events == 1)
>>>>>> + if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>> return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>>>>> else
>>>>>> return (fp_ops->dequeue_burst)(port, ev, nb_events,
>>>>>> --
>>>>>> 2.34.1
>>>>>>
>>>>
More information about the dev
mailing list