<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<snipped><br>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<blockquote type="cite">
<br>
<blockquote type="cite">I recently looked into how Intel's
Sub-NUMA Clustering would work within
<br>
DPDK, and found that I actually didn't have to do anything,
because the
<br>
SNC "clusters" present themselves as NUMA nodes, which DPDK
already
<br>
supports natively.
<br>
</blockquote>
<br>
yes, this is correct. In Intel Xeon Platinum BIOS one can enable
<br>
`Cluster per NUMA` as `1,2 or4`.
<br>
<br>
This divides the tiles into Sub-Numa parition, each having
separate
<br>
lcores,memory controllers, PCIe
<br>
<br>
and accelerator.
<br>
<br>
<blockquote type="cite">
<br>
Does AMD's implementation of chiplets not report themselves as
separate
<br>
NUMA nodes?
<br>
</blockquote>
<br>
In AMD EPYC Soc, this is different. There are 2 BIOS settings,
namely
<br>
<br>
1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe
and
<br>
Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
<br>
<br>
2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This
allows
<br>
all CPU tiles to be independent NUMA cores.
<br>
<br>
<br>
The above settings are possible because CPU is independent from
IO tile.
<br>
Thus allowing 4 combinations be available for use.
<br>
</blockquote>
<br>
Sure, but presumably if the user wants to distinguish this, they
have to
<br>
configure their system appropriately. If user wants to take
advantage of
<br>
L3 as NUMA (which is what your patch proposes), then they can
enable the
<br>
BIOS knob and get that functionality for free. DPDK already
supports this.
<br>
<br>
</blockquote>
<p>The intend of the RFC is to introduce the ability to select lcore
within the same</p>
<p>L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This
is also achieved</p>
<p>and tested on platforms which advertises via sysfs by OS kernel.
Thus eliminating</p>
<p>the dependency on hwloc and libuma which can be different
versions in different distros. <br>
</p>
<br>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<blockquote type="cite">
<br>
These are covered in the tuning gudie for the SoC in 12. How to
get best
<br>
performance on AMD platform — Data Plane Development Kit 24.07.0
<br>
documentation (dpdk.org)
<br>
<a class="moz-txt-link-rfc2396E" href="https://doc.dpdk.org/guides/linux_gsg/amd_platform.html"><https://doc.dpdk.org/guides/linux_gsg/amd_platform.html></a>.
<br>
<br>
<br>
<blockquote type="cite">Because if it does, I don't really think
any changes are
<br>
required because NUMA nodes would give you the same thing,
would it not?
<br>
</blockquote>
<br>
I have a different opinion to this outlook. An end user can
<br>
<br>
1. Identify the lcores and it's NUMA user
`usertools/cpu-layout.py`
<br>
</blockquote>
<br>
I recently submitted an enhacement for CPU layout script to print
out
<br>
NUMA separately from physical socket [1].
<br>
<br>
[1]
<br>
<a class="moz-txt-link-freetext" href="https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/">https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/</a>
<br>
<br>
I believe when "L3 as NUMA" is enabled in BIOS, the script will
display
<br>
both physical package ID as well as NUMA nodes reported by the
system,
<br>
which will be different from physical package ID, and which will
display
<br>
information you were looking for.
<br>
</blockquote>
<p>As AMD we had submitted earlier work on the same via <a href="https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/">usertools:
enhance logic to display NUMA - Patchwork (dpdk.org)</a>.</p>
<p>this clearly were distinguishing NUMA and Physical socket.<br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
<blockquote type="cite">
<br>
2. But it is core mask in eal arguments which makes the threads
<br>
available to be used in a process.
<br>
</blockquote>
<br>
See above: if the OS already reports NUMA information, this is not
a
<br>
problem to be solved, CPU layout script can give this information
to the
<br>
user.
<br>
</blockquote>
<p>Agreed, but as pointed out in case of Intel Xeon Platinum SPR,
the tile consists of cpu, memory, pcie and accelerator.</p>
<p>hence setting the BIOS option `Cluster per NUMA` the OS kernel
& libnuma display appropriate Domain with memory, pcie and
cpu.</p>
<p><br>
</p>
<p>In case of AMD SoC, libnuma for CPU is different from memory NUMA
per socket.<br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
<blockquote type="cite">
<br>
3. there are no API which distinguish L3 numa domain. Function
<br>
`rte_socket_id
<br>
<a class="moz-txt-link-rfc2396E" href="https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df"><https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df></a>`
for CPU tiles like AMD SoC will return physical socket.
<br>
</blockquote>
<br>
Sure, but I would think the answer to that would be to introduce
an API
<br>
to distinguish between NUMA (socket ID in DPDK parlance) and
package
<br>
(physical socket ID in the "traditional NUMA" sense). Once we can
<br>
distinguish between those, DPDK can just rely on NUMA information
<br>
provided by the OS, while still being capable of identifying
physical
<br>
sockets if the user so desires.
<br>
</blockquote>
Agreed, +1 for the idea for physcial socket and changes in library
to exploit the same.<br>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
I am actually going to introduce API to get *physical socket* (as
<br>
opposed to NUMA node) in the next few days.
<br>
<br>
</blockquote>
<p>But how does it solve the end customer issues</p>
<p>1. if there are multiple NIC or Accelerator on multiple socket,
but IO tile is partitioned to Sub Domain.</p>
<p>2. If RTE_FLOW steering is applied on NIC which needs to
processed under same L3 - reduces noisy neighbor and better cache
hits <br>
</p>
<p>3, for PKT-distribute library which needs to run within same
worker lcore set as RX-Distributor-TX.</p>
<p><br>
</p>
<p>Current RFC suggested addresses the above, by helping the end
users to identify the lcores withing same L3 domain under a
NUMA|Physical socket irresepctive of BIOS setting. <br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<blockquote type="cite">
<br>
<br>
Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU
tiles
<br>
and 1 IO tile. Setting
<br>
<br>
1. NPS to 4 will divide the memory, PCIe and accelerator into 4
domain.
<br>
While the all CPU will appear as single NUMA but each 12 tile
having
<br>
independent L3 caches.
<br>
<br>
2. Setting `L3 as NUMA` allows each tile to appear as separate
L3 clusters.
<br>
<br>
<br>
Hence, adding an API which allows to select available lcores
based on
<br>
Split L3 is essential irrespective of the BIOS setting.
<br>
<br>
</blockquote>
<br>
I think the crucial issue here is the "irrespective of BIOS
setting"
<br>
bit.</blockquote>
<p>That is what the current RFC achieves.<br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com"> If EAL
is getting into the game of figuring out exact intricacies
<br>
of physical layout of the system, then there's a lot more work to
be
<br>
done as there are lots of different topologies, as other people
have
<br>
already commented, and such an API needs *a lot* of thought put
into it.
<br>
</blockquote>
<p>There is standard sysfs interfaces for CPU cache topology (OS
kernel), as mentioned earlier</p>
<p>problem with hwloc and libnuma is different distros has different
versions. There are solutions for</p>
<p>specific SoC architectures as per latest comment.</p>
<p><br>
</p>
<p>But we always can limit the API to selected SoC, while all other
SoC when invoked will invoke rte_get_next_lcore.<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
If, on the other hand, we leave this issue to the kernel, and only
<br>
gather NUMA information provided by the kernel, then nothing has
to be
<br>
done - DPDK already supports all of this natively, provided the
user has
<br>
configured the system correctly.
<br>
</blockquote>
<p>As shared above, we tried to bring this <a href="https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/">usertools:
enhance logic to display NUMA - Patchwork (dpdk.org)</a>. <br>
</p>
<p>DPDK support for lcore is getting enhanced and allowing user to
use more favorable lcores within same Tile.<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
Moreover, arguably DPDK already works that way: technically you
can get
<br>
physical socket information even absent of NUMA support in BIOS,
but
<br>
DPDK does not do that. Instead, if OS reports NUMA node as 0,
that's
<br>
what we're going with (even if we could detect multiple sockets
from
<br>
sysfs), </blockquote>
<p>In the above argument, it is shared as OS kernel detects NUMA or
domain, which is used by DPDK right?</p>
<p>The RFC suggested also adheres to the same, what OS sees. can you
please explain for better understanding</p>
<p>what in the RFC is doing differently?</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">and IMO
it should stay that way unless there is a strong
<br>
argument otherwise.</blockquote>
<p>Totally agree, that is what the RFC is also doing, based on what
OS sees as NUMA we are using it.</p>
<p>Only addition is within the NUMA if there are split LLC, allow
selection of those lcores. Rather than blindly choosing lcore
using</p>
<p>rte_lcore_get_next.<br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com"> We
force the user to configure their system
<br>
correctly as it is, and I see no reason to second-guess user's
BIOS
<br>
configuration otherwise.
<br>
</blockquote>
<p>Again iterating, the changes suggested in RFC are agnostic to
what BIOS options are used,</p>
<p>It is to earlier question `is AMD configuration same as Intel
tile` I have explained it is not using BIOS setting.</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com">
<br>
--
<br>
Thanks,
<br>
Anatoly
<br>
<br>
</blockquote>
</body>
</html>