[RFC 0/2] introduce LLC aware functions

Mattias Rönnblom hofors at lysator.liu.se
Wed Sep 4 11:30:59 CEST 2024


On 2024-09-02 02:39, Varghese, Vipin wrote:
> <snipped>
> 
> Thank you Mattias for the comments and question, please let me try to 
> explain the same below
> 
>> We shouldn't have a separate CPU/cache hierarchy API instead?
> 
> Based on the intention to bring in CPU lcores which share same L3 (for 
> better cache hits and less noisy neighbor) current API focuses on using
> 
> Last Level Cache. But if the suggestion is `there are SoC where L2 cache 
> are also shared, and the new API should be provisioned`, I am also
> 
> comfortable with the thought.
> 

Rather than some AMD special case API hacked into <rte_lcore.h>, I think 
we are better off with no DPDK API at all for this kind of functionality.

A DPDK CPU/memory hierarchy topology API very much makes sense, but it 
should be reasonably generic and complete from the start.

>>
>> Could potentially be built on the 'hwloc' library.
> 
> There are 3 reason on AMD SoC we did not explore this path, reasons are
> 
> 1. depending n hwloc version and kernel version certain SoC hierarchies 
> are not available
> 
> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD Epyc Soc.
> 
> 3. adds the extra dependency layer of library layer to be made available 
> to work.
> 
> 
> hence we have tried to use Linux Documented generic layer of `sysfs CPU 
> cache`.
> 
> I will try to explore more on hwloc and check if other libraries within 
> DPDK leverages the same.
> 
>>
>> I much agree cache/core topology may be of interest of the application
>> (or a work scheduler, like a DPDK event device), but it's not limited to
>> LLC. It may well be worthwhile to care about which cores shares L2
>> cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales.
> 
> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
> 
> 
> Can we rework the API to be rte_get_cache_<function> where user argument 
> is desired lcore index.
> 
> 1. index-1: SMT threads
> 
> 2. index-2: threads sharing same L2 cache
> 
> 3. index-3: threads sharing same L3 cache
> 
> 4. index-MAX: identify the threads sharing last level cache.
> 
>>
>>> < Function: Purpose >
>>> ---------------------
>>>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the 
>>> shared LLC.
>>>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n 
>>> lcores in the shared LLC.
>>>
>>> < MACRO: Purpose >
>>> ------------------
>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from 
>>> each LLC.
>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker 
>>> lcore from each LLC.
>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint 
>>> (lcore id).
>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC 
>>> while skipping first worker.
>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores 
>>> from each LLC.
>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then 
>>> iterates through reaming lcores in each LLC.
>>>
> While the MACRO are simple wrapper invoking appropriate API. can this be 
> worked out in this fashion?
> 
> <snipped>


More information about the dev mailing list