[dpdk-dev] [PATCH v1] event/sw: performance improvements

Nicolau, Radu radu.nicolau at intel.com
Wed Oct 7 12:44:39 CEST 2020


On 10/6/2020 11:13 AM, Ananyev, Konstantin wrote:
>>> -----Original Message-----
>>> From: Jerin Jacob <jerinjacobk at gmail.com>
>>> Sent: Monday, October 5, 2020 5:35 PM
>>> To: Nicolau, Radu <radu.nicolau at intel.com>
>>> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>; Richardson, Bruce
>>> <bruce.richardson at intel.com>; Ananyev, Konstantin
>>> <konstantin.ananyev at intel.com>; Van Haaren, Harry
>>> <harry.van.haaren at intel.com>; dev at dpdk.org; jerinj at marvell.com; nd
>>> <nd at arm.com>
>>> Subject: Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements
>>>
>>> On Tue, Sep 29, 2020 at 2:32 PM Nicolau, Radu <radu.nicolau at intel.com> wrote:
>>>>
>>>> On 9/28/2020 5:02 PM, Honnappa Nagarahalli wrote:
>>>>> <snip>
>>>>>>> Add minimum burst throughout the scheduler pipeline and a flush counter.
>>>>>>> Replace ring API calls with local single threaded implementation where
>>>>>>> possible.
>>>>>>>
>>>>>>> Signed-off-by: Radu Nicolau mailto:radu.nicolau at intel.com
>>>>>>>
>>>>>>> Thanks for the patch, a few comments inline.
>>>>>>>
>>>>>>> Why not make these APIs part of the rte_ring library? You could further
>>>>>> optimize them by keeping the indices on the same cacheline.
>>>>>>> I'm not sure there is any need for non thread-safe rings outside this
>>>>>> particular case.
>>>>>>> [Honnappa] I think if we add the APIs, we will find the use cases.
>>>>>>> But, more than that, I understand that rte_ring structure is exposed to the
>>>>>> application. The reason for doing that is the inline functions that rte_ring
>>>>>> provides. IMO, we should still maintain modularity and should not use the
>>>>>> internals of the rte_ring structure outside of the library.
>>>>>>> +1 to that.
>>>>>>>
>>>>>>> BTW, is there any real perf benefit from such micor-optimisation?
>>>>>> I'd tend to view these as use-case specific, and I'm not sure we should clutter
>>>>>> up the ring library with yet more functions, especially since they can't be
>>>>>> mixed with the existing enqueue/dequeue functions, since they don't use
>>>>>> the head pointers.
>>>>> IMO, the ring library is pretty organized with the recent addition of HTS/RTS
>>> modes. This can be one of the modes and should allow us to use the existing
>>> functions (though additional functions are required as well).
>>>>> The other concern I have is, this implementation can be further optimized by
>>> using a single cache line for the pointers. It uses 2 cache lines just because of the
>>> layout of the rte_ring structure.
>>>>> There was a question earlier about the performance improvements of this
>>> patch? Are there any % performance improvements that can be shared?
>>>>> It is also possible to change the above functions to use the head/tail pointers
>>> from producer or the consumer cache line alone to check for perf differences.
>>>> I don't have a % for the final improvement for this change alone, but
>>>> there was some improvement in the memory overhead measurable during
>>>> development, which very likely resulted in the whole optimization having
>>>> more headroom.
>>>>
>>>> I agree that this may be further optimized, maybe by having a local
>>>> implementation of a ring-like container instead.
>>> Have we decided on the next steps for this patch? Is the plan to
>>> supersede this patch and have different
>>> one in rte_ring subsystem,
>> My preference is to merge this version of the patch;
>> 1) The ring helper functions are stripped to the SW PMD usage, and not valid to use in the general.
>> 2) Adding static inline APIs in an LTS without extensive doesn't seem a good idea.
>>
>> If Honnappa is OK with the above solution for 20.11, we can see about moving the rings part of the
>> code to rte_ring library location in 21.02, and give ourselves some time to settle the usage/API before
>> the next LTS.
>>
> As ring library maintainer I share Honnappa concern that another library not uses public ring API,
> but instead accesses ring internals directly. Obviously such coding practice is not welcomed
> as it makes harder to maintain/extend ring library in future.
> About 2) - these new API can(/shoud) be marked an experimental anyway.
> As another thing - it is still unclear what a performance gain we are talking about here.
> Is it really worth it comparing to just using SP/SC?

The change itself came after I analyzed the memory bound sections of the 
code, and I just did a quick test, I got about 3.5% improvement in 
throughput,  maybe not so much but significant for such a small change, 
and depending on the usecase it may be more.

As for the implementation itself, I would favour having a custom ring 
like container in the PMD code, this will solve the issue of using 
rte_ring internals while still allow for full optimisation. If this is 
acceptable, I will follow up by tomorrow.



More information about the dev mailing list