[RFC 0/2] introduce LLC aware functions

Burakov, Anatoly anatoly.burakov at intel.com
Thu Sep 5 16:45:10 CEST 2024
Previous message (by thread): [RFC 0/2] introduce LLC aware functions
Next message (by thread): [RFC 0/2] introduce LLC aware functions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 9/5/2024 3:05 PM, Ferruh Yigit wrote:
> On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
>> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>>> <snipped>
>>>>>

Hi Ferruh,

>>
>> I feel like there's a disconnect between my understanding of the problem
>> space, and yours, so I'm going to ask a very basic question:
>>
>> Assuming the user has configured their AMD system correctly (i.e.
>> enabled L3 as NUMA), are there any problem to be solved by adding a new
>> API? Does the system not report each L3 as a separate NUMA node?
>>
> 
> Hi Anatoly,
> 
> Let me try to answer.
> 
> To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
> far as I understand SNC is more similar to more classic physical socket
> based NUMA.
> 
> Following is the AMD CPU:
>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │TILE1││TILE2││          ││TILE5││TILE6│
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        └─────┘└─────┘│    IO    │└─────┘└─────┘
>        ┌─────┐┌─────┐│   TILE   │┌─────┐┌─────┐
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │TILE3││TILE4││          ││TILE7││TILE8│
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        └─────┘└─────┘└──────────┘└─────┘└─────┘
> 
> Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
> controllers etc..
> 
> When NPS=x configured in bios, IO tile resources are split and each seen
> as a NUMA node.
> 
> Following is NPS=4
>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        │TILE1││TILE2││     .    ││TILE5││TILE6│
>        │     ││     ││NUMA .NUMA││     ││     │
>        │     ││     ││ 0   . 1  ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        └─────┘└─────┘│     .    │└─────┘└─────┘
>        ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││NUMA .NUMA││     ││     │
>        │TILE3││TILE4││ 2   . 3  ││TILE7││TILE8│
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        └─────┘└─────┘└─────.────┘└─────┘└─────┘
> 
> Benefit of this is approach is now all cores has to access all NUMA
> without any penalty. Like a DPDK application can use cores from 'TILE1',
> 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
> performance.
> This is different than SNC where cores access to cross NUMA resources
> hit by performance penalty.
> 
> Now, although which tile cores come from doesn't matter from NUMA
> perspective, it may matter (based on workload) to have them under same LLC.
> 
> One way to make sure all cores are under same LLC, is to enable "L3 as
> NUMA" BIOS option, which will make each TILE shown as a different NUMA,
> and user select cores from one NUMA.
> This is sufficient up to some point, but not enough when application
> needs number of cores that uses multiple tiles.
> 
> Assume each tile has 8 cores, and application needs 24 cores, when user
> provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
> now way for application to figure out how to group/select these cores to
> use cores efficiently.
> 
> Indeed this is what Vipin is enabling, from a core, he is finding list
> of cores that will work efficiently with this core. In this perspective
> this is nothing really related to NUMA configuration, and nothing really
> specific to AMD, as defined Linux sysfs interface is used for this.
> 
> There are other architectures around that has similar NUMA configuration
> and they can also use same logic, at worst we can introduce an
> architecture specific code that all architectures can have a way to find
> other cores that works more efficient with given core. This is a useful
> feature for DPDK.
> 
> Lets looks into another example, application uses 24 cores in an graph
> library like usage, that we want to group each three cores to process a
> graph node. Application needs to a way to select which three cores works
> most efficient with eachother, that is what this patch enables. In this
> case enabling "L3 as NUMA" does not help at all. With this patch both
> bios config works, but of course user should select cores to provide
> application based on configuration.
> 
> 
> And we can even improve this effective core selection, like as Mattias
> suggested we can select cores that share L2 caches, with expansion of
> this patch. This is unrelated to NUMA, and again it is not introducing
> architecture details to DPDK as this implementation already relies on
> Linux sysfs interface.
> 
> I hope it clarifies a little more.
> 
> 
> Thanks,
> ferruh
> 

Yes, this does help clarify things a lot as to why current NUMA support 
would be insufficient to express what you are describing.

However, in that case I would echo sentiment others have expressed 
already as this kind of deep sysfs parsing doesn't seem like it would be 
in scope for EAL, it sounds more like something a sysadmin/orchestration 
(or the application itself) would do.

I mean, in principle I'm not opposed to having such an API, it just 
seems like the abstraction would perhaps need to be a bit more robust 
than directly referencing cache structure? Maybe something that 
degenerates into NUMA nodes would be better, so that applications 
wouldn't have to *specifically* worry about cache locality but instead 
have a more generic API they can use to group cores together?

-- 
Thanks,
Anatoly
Previous message (by thread): [RFC 0/2] introduce LLC aware functions
Next message (by thread): [RFC 0/2] introduce LLC aware functions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list