[RFC 0/2] introduce LLC aware functions
    Varghese, Vipin 
    Vipin.Varghese at amd.com
       
    Thu Sep 12 13:17:19 CEST 2024
    
    
  
[AMD Official Use Only - AMD Internal Distribution Only]
<snipped>
> >>>> Thank you Mattias for the information, as shared by in the reply
> >>>> with
> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
> >> intakes a extra argument `u32 flags`.
> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> >> RTE_GET_LCORE_BOOST_DISABLED.
> >>>
> >>> Wouldn't using that API be pretty awkward to use?
> > Current API available under DPDK is ` rte_get_next_lcore`, which is used
> within DPDK example and in customer solution.
> > Based on the comments from others we responded to the idea of changing
> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
> >
> > Can you please help us understand what is `awkward`.
> >
>
> The awkwardness starts when you are trying to fit provide hwloc type
> information over an API that was designed for iterating over lcores.
I disagree to this point, current implementation of lcore libraries is only focused on iterating through list of enabled cores, core-mask, and lcore-map.
With ever increasing core count, memory, io and accelerators on SoC, sub-numa partitioning is common in various vendor SoC. Enhancing or Augumenting lcore API to extract or provision NUMA, Cache Topology is not awkward.
If memory, IO and accelerator can have sub-NUMA domain, why is it awkward to have lcore in domains? Hence I do not agree on the awkwardness argument.
>
> It seems to me that you should either have:
> A) An API in similar to that of hwloc (or any DOM-like API), which would give a
> low-level description of the hardware in implementation terms.
> The topology would consist of nodes, with attributes, etc, where nodes are
> things like cores or instances of caches of some level and attributes are things
> like CPU actual and nominal, and maybe max frequency, cache size, or memory
> size.
Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` and populates thread (lcore) to physical CPU. But there is more than just CPU mapping, as we have seeing in SoC architecture. The argument shared by many is `DPDK is not the place for such topology discovery`.
As per my current understanding, I have to disagree to the abive because
1. forces user to use external libraries example like hwloc
2. forces user to creating internal mapping for lcore, core-mask, and lcore-map with topology awareness code.
My intention is to `enable end user to leverage the API format or similar API format (rte_get_next_lcore)` to get best results on any SoC (vendor agnostic).
I fail to grasp why we are asking CPU topology to exported, while NIC, PCIe and accelerators are not asked to be exported via external libraries like hwloc.
Hence let us setup tech call in slack or teams to understand this better.
> or
> B) An API to be directly useful for a work scheduler, in which case you should
> abstract away things like "boost"
Please note as shared in earlier reply to Bruce, I made a mistake of calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO.
There are use cases and DPDK examples, where cypto and compression are run on cores where TURBO is enabled. This allows end users to boost when there is more work and disable boost when there is less or no work.
>  (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and
> have an abstract notion of what core is "close" to some other core. This would
> something like Linux'
> scheduling domains.
We had similar discussion with Jerrin on the last day of Bangkok DPDK summit. This RFC was intended to help capture this relevant point. With my current understanding on selected SoC the little core on ARM Soc shares L2 cache, while this analogy does not cover all cases. But this would be good start.
>
> If you want B you probably need A as a part of its implementation, so you may
> just as well start with A, I suppose.
>
> What you could do to explore the API design is to add support for, for
> example, boost core awareness or SMT affinity in the SW scheduler. You could
> also do an "lstopo" equivalent, since that's needed for debugging and
> exploration, if nothing else.
Not following on this analogy, will discuss in detail in tech talk
>
> One question that will have to be answered in a work scheduling scenario is
> "are these two lcores SMT siblings," or "are these two cores on the same LLC",
> or "give me all lcores on a particular L2 cache".
>
Is not that we have been trying to address based on Anatoly request to generalize than LLC. Hence we agreed on sharing version-2 of RFC with `rte_get_nex_lcore_extnd` with `flags`.
May I ask where is the disconnect?
> >>>
> >>> I mean, what you have is a topology, with nodes of different types
> >>> and with
> >> different properties, and you want to present it to the user.
> > Let me be clear, what we want via DPDK to help customer to use an Unified
> API which works across multiple platforms.
> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores
> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA
> domain based on split LLC.
> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it
> > returns all cores as there is no split 2. CPU-B: it returns cores from
> > specific sub-NUMA which is partitioned by L3
> >
>
> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :)
I humbly disagree to this statement, as explained above.
>
> My gut feeling is to make it more explicit and forget about <rte_lcore.h>.
> <rte_hwtopo.h>? Could and should still be EAL.
For me this is like adding a new level of library and more code. While the easiest way was to add an API similar to existing `get_next_lcore` style for easy adoption.
>
> >>>
> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is
> >>> SAX-style,
> >> and what I have in mind is something DOM-like.
> >>>
> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them.
> > Exactly.
> >
> >   It's not going to "skip" (ignore, leave unused)
> >> SMT siblings, or skip non-boosted cores, it would just try to be
> >> clever in regards to which cores to use for what purpose.
> > Let me try to share my idea on SMT sibling. When user invoked for
> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API
> identifies first whether given lcore is part of enabled core list.
> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the
> version concern on distros. Will recheck again)` identify the sibling thread and
> return.
> > If there is no sibling thread available under DPDK it will fetch next lcore
> (probably lcore +1 ).
> >
>
> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project.
I do not agree to terms of ` Distributions having old hwloc versions isn't an argument for a new DPDK library or new API.` Because this is not what my intention is. Let me be clear on Ampere & AMD Bios settings are 2
1. SLC or L3 as NUMA enable
2. Numa for IO|memory
With `NUMA for IO|memory` is set hwloc library works as expected. But when `L3 as NUMA` is set gives incorrect details. We have been fixing this and pushing to upstream. But as I clearly shared, version of distros having latest hwloc is almost nil.
Hence to keep things simple, in documentation of DPDK we pointed to AMD SoC tuning guide we have been recommending not to enable `L3 as NUMA`.
Now end goal for me is to allow vendor agnostic API which is easy to understand and use, and works irrespective of BIOS settings. I have enabled parsing of OS `sysfs` as a RFC. But if the comment is to use `hwloc` as shared with response for Stephen I am open to try this again.
<snipped>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mails.dpdk.org/archives/dev/attachments/20240912/6a82d8c4/attachment-0001.htm>
    
    
More information about the dev
mailing list