[PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts

Mattias Rönnblom hofors at lysator.liu.se
Wed May 17 09:16:24 CEST 2023
Previous message (by thread): [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts
Next message (by thread): [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2023-05-16 15:08, Jerin Jacob wrote:
> On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hofors at lysator.liu.se> wrote:
>>
>> On 2023-05-15 14:38, Jerin Jacob wrote:
>>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
>>> <mattias.ronnblom at ericsson.com> wrote:
>>>>
>>>> On 2023-05-12 13:59, Jerin Jacob wrote:
>>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
>>>>> <mattias.ronnblom at ericsson.com> wrote:
>>>>>>
>>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>>>>>> dequeue only when the burst size is compile-time constant (and equal
>>>>>> to one).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> v3: Actually include the change v2 claimed to contain.
>>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>>>>>        application is compiled with -pedantic. (Morten Brørup)
>>>>>> ---
>>>>>>     lib/eventdev/rte_eventdev.h | 4 ++--
>>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>>>>>> index a90e23ac8b..a471caeb6d 100644
>>>>>> --- a/lib/eventdev/rte_eventdev.h
>>>>>> +++ b/lib/eventdev/rte_eventdev.h
>>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>>>>>>             * Allow zero cost non burst mode routine invocation if application
>>>>>>             * requests nb_events as const one
>>>>>>             */
>>>>>> -       if (nb_events == 1)
>>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>
>>>>> "Why" part is not clear from the commit message. Is this to avoid
>>>>> nb_events read if it is built-in const.
>>>>
>>>> The __builtin_constant_p() is introduced to avoid having the compiler
>>>> generate a conditional branch and two different code paths in case
>>>> nb_elem is a run-time variable.
>>>>
>>>> In particular, this matters if nb_elems is run-time variable and varies
>>>> between 1 and some larger value.
>>>>
>>>> I should have mention this in the commit message.
>>>>
>>>> A very slight performance improvement. It also makes the code better
>>>> match the comment, imo. Zero cost for const one enqueues, but no impact
>>>> non-compile-time-constant-length enqueues.
>>>>
>>>> Feel free to ignore.
>>>
>>>
>>> I did some performance comparison of the patch.
>>> A low-end ARM machines shows 0.7%  drop with single event case. No
>>> regression see with high-end ARM cores with single event case.
>>>
>>> IMO, optimizing the check for burst mode(the new patch) may not show
>>> any real improvement as the cost is divided by number of event.
>>> Whereas optimizing the check for single event case(The current code)
>>> shows better performance with single event case and no regression
>>> with burst mode as cost is divided by number of events.
>>
>> I ran some tests on an AMD Zen 3 with DSW.
>> In the below tests the enqueue burst size is not compile-time constant.
>>
>> Enqueue burst size      Performance improvement
>> Run-time constant 1     ~5%
>> Run-time constant 2     ~0%
>> Run-time variable 1-2   ~9%
>> Run-time variable 1-16  ~0%
>>
>> The run-time variable enqueue sizes randomly (uniformly) distributed in
>> the specified range.
>>
>> The first result may come as a surprise. The benchmark is using
>> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
>> in most apps). The single-event enqueue function only exists in a
>> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
>> I suspect that is the reason for the performance improvement.
>>
>> This effect is large-enough to make it somewhat beneficial (+~1%) to use
>> run-time variable single-event enqueue compared to keeping the burst
>> size compile-time constant.
> 
> # Interesting, Could you share your testeventdev command to test it.

I'm using a proprietary benchmark to evaluate the effect of these 
changes. There's certainly nothing secret about that program, and also 
nothing very DSW-specific either. I hope to at some point both extend 
DPDK eventdev tests to include DSW, and also to contribute 
benchmarks/characteristics tests (perf unit tests or as a separate 
program), if there seems to be a value in this.

> # By having quick glance on DSW code, following change can be added(or
>   similar cases).
> Not sure such change in DSW driver is making a difference or nor?
> 
> 
> diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
> index e84b65d99f..455470997b 100644
> --- a/drivers/event/dsw/dsw_event.c
> +++ b/drivers/event/dsw/dsw_event.c
> @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
> *dsw, struct dsw_port *source_port)
>   uint16_t
>   dsw_event_enqueue(void *port, const struct rte_event *ev)
>   {
> -       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
> +       return dsw_event_enqueue_burst(port, ev, 1);

Good point.

Historical note: I think that comparison is old cruft borne out of a 
misconception, that the single-event enqueue could be called directly 
from application code, combined with the fact that producer-only ports 
needed some way to "maintain" a port, prior to the introduction of 
rte_event_maintain().

>   }
> 
>   static __rte_always_inline uint16_t
> @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
> *source_port,
>          return (num_non_release + num_release);
>   }
> 
> -uint16_t
> +inline uint16_t

 From what it seems, this does not have the desired effect, at least not 
on GCC 11.3 (w/ the default DPDK compiler configuration).

I reached this conclusion when I noticed that if I reshuffle the code so 
to force (not hint) the inlining of the burst (and generic burst) 
enqueue function into dsw_event_enqueue(), your change performs better.

>   dsw_event_enqueue_burst(void *port, const struct rte_event events[],
>                          uint16_t events_len)
>   {
> 
> # I am testing with command like this "/build/app/dpdk-test-eventdev
> -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
> --stlist p --nb_pkts=10000000000"
> 


I re-ran the compile-time variable, run-time constant enqueue size of 1, 
and I got the following:

Jerin's change: +4%
Jerin's change + ensure inlining: +6%
RFC v3: +7%

(Here I use a more different setup that produces more deterministic 
results, hence the different numbers compared to the previous runs. They 
were using a pipeline spread over two chiplets, and these runs are using 
only a single chiplet.)

It seems like with your suggested changes you eliminate most of the 
single-enqueue-special case performance degradation (for DSW), but not 
all of it. The remaining degradation is very small (for the above case, 
larger for small by run-time variable enqueue sizes), but it's a little 
sad that a supposedly performance-enhancing special case (that drives 
complexity in the code, although not much) actually degrades performance.

>>
>> The performance gain is counted toward both enqueue and dequeue costs
>> (+benchmark app overhead), so an under-estimation if see this as an
>> enqueue performance improvement.
>>
>>> If you agree, then we can skip this patch.
>>>
>>
>> I have no strong opinion if this should be included or not.
>>
>> It was up to me, I would drop the single-enqueue special case handling
>> altogether in the next ABI update.
> 
> That's a reasonable path. If we are willing to push a patch, we can
> test it and give feedback.
> Or in our spare time, We can do that as well.
> 

Sure, I'll give it a try.

The next release is an ABI-breaking one?

>>
>>>
>>>>
>>>>> If so, check should be following. Right?
>>>>>
>>>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>> || nb_events  == 1)
>>>>>
>>>>> At least, It was my original intention in the code.
>>>>>
>>>>>
>>>>>
>>>>>>                    return (fp_ops->enqueue)(port, ev);
>>>>>>            else
>>>>>>                    return fn(port, ev, nb_events);
>>>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>>>>>>             * Allow zero cost non burst mode routine invocation if application
>>>>>>             * requests nb_events as const one
>>>>>>             */
>>>>>> -       if (nb_events == 1)
>>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>>                    return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>>>>>            else
>>>>>>                    return (fp_ops->dequeue_burst)(port, ev, nb_events,
>>>>>> --
>>>>>> 2.34.1
>>>>>>
>>>>
Previous message (by thread): [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts
Next message (by thread): [PATCH v3] eventdev: avoid non-burst shortcut for variable-size bursts
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list