[RFC 0/2] introduce LLC aware functions
Burakov, Anatoly
anatoly.burakov at intel.com
Thu Sep 5 16:45:10 CEST 2024
On 9/5/2024 3:05 PM, Ferruh Yigit wrote:
> On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
>> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>>> <snipped>
>>>>>
Hi Ferruh,
>>
>> I feel like there's a disconnect between my understanding of the problem
>> space, and yours, so I'm going to ask a very basic question:
>>
>> Assuming the user has configured their AMD system correctly (i.e.
>> enabled L3 as NUMA), are there any problem to be solved by adding a new
>> API? Does the system not report each L3 as a separate NUMA node?
>>
>
> Hi Anatoly,
>
> Let me try to answer.
>
> To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
> far as I understand SNC is more similar to more classic physical socket
> based NUMA.
>
> Following is the AMD CPU:
> ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> │TILE1││TILE2││ ││TILE5││TILE6│
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> └─────┘└─────┘│ IO │└─────┘└─────┘
> ┌─────┐┌─────┐│ TILE │┌─────┐┌─────┐
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> │TILE3││TILE4││ ││TILE7││TILE8│
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> │ ││ ││ ││ ││ │
> └─────┘└─────┘└──────────┘└─────┘└─────┘
>
> Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
> controllers etc..
>
> When NPS=x configured in bios, IO tile resources are split and each seen
> as a NUMA node.
>
> Following is NPS=4
> ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
> │ ││ ││ . ││ ││ │
> │ ││ ││ . ││ ││ │
> │TILE1││TILE2││ . ││TILE5││TILE6│
> │ ││ ││NUMA .NUMA││ ││ │
> │ ││ ││ 0 . 1 ││ ││ │
> │ ││ ││ . ││ ││ │
> └─────┘└─────┘│ . │└─────┘└─────┘
> ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
> │ ││ ││ . ││ ││ │
> │ ││ ││NUMA .NUMA││ ││ │
> │TILE3││TILE4││ 2 . 3 ││TILE7││TILE8│
> │ ││ ││ . ││ ││ │
> │ ││ ││ . ││ ││ │
> │ ││ ││ . ││ ││ │
> └─────┘└─────┘└─────.────┘└─────┘└─────┘
>
> Benefit of this is approach is now all cores has to access all NUMA
> without any penalty. Like a DPDK application can use cores from 'TILE1',
> 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
> performance.
> This is different than SNC where cores access to cross NUMA resources
> hit by performance penalty.
>
> Now, although which tile cores come from doesn't matter from NUMA
> perspective, it may matter (based on workload) to have them under same LLC.
>
> One way to make sure all cores are under same LLC, is to enable "L3 as
> NUMA" BIOS option, which will make each TILE shown as a different NUMA,
> and user select cores from one NUMA.
> This is sufficient up to some point, but not enough when application
> needs number of cores that uses multiple tiles.
>
> Assume each tile has 8 cores, and application needs 24 cores, when user
> provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
> now way for application to figure out how to group/select these cores to
> use cores efficiently.
>
> Indeed this is what Vipin is enabling, from a core, he is finding list
> of cores that will work efficiently with this core. In this perspective
> this is nothing really related to NUMA configuration, and nothing really
> specific to AMD, as defined Linux sysfs interface is used for this.
>
> There are other architectures around that has similar NUMA configuration
> and they can also use same logic, at worst we can introduce an
> architecture specific code that all architectures can have a way to find
> other cores that works more efficient with given core. This is a useful
> feature for DPDK.
>
> Lets looks into another example, application uses 24 cores in an graph
> library like usage, that we want to group each three cores to process a
> graph node. Application needs to a way to select which three cores works
> most efficient with eachother, that is what this patch enables. In this
> case enabling "L3 as NUMA" does not help at all. With this patch both
> bios config works, but of course user should select cores to provide
> application based on configuration.
>
>
> And we can even improve this effective core selection, like as Mattias
> suggested we can select cores that share L2 caches, with expansion of
> this patch. This is unrelated to NUMA, and again it is not introducing
> architecture details to DPDK as this implementation already relies on
> Linux sysfs interface.
>
> I hope it clarifies a little more.
>
>
> Thanks,
> ferruh
>
Yes, this does help clarify things a lot as to why current NUMA support
would be insufficient to express what you are describing.
However, in that case I would echo sentiment others have expressed
already as this kind of deep sysfs parsing doesn't seem like it would be
in scope for EAL, it sounds more like something a sysadmin/orchestration
(or the application itself) would do.
I mean, in principle I'm not opposed to having such an API, it just
seems like the abstraction would perhaps need to be a bit more robust
than directly referencing cache structure? Maybe something that
degenerates into NUMA nodes would be better, so that applications
wouldn't have to *specifically* worry about cache locality but instead
have a more generic API they can use to group cores together?
--
Thanks,
Anatoly
More information about the dev
mailing list