<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="Generator" content="Microsoft Exchange Server">  <style></style> </head> <body> <font face="Calibri" size="2"><span style="font-size:10pt;"> <div style="padding-right:5pt;padding-left:5pt;"><font color="green">[Public]<br> </font></div> <div style="margin-top:5pt;"><font face="Times New Roman" size="3"><span style="font-size:12pt;"><br> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"><snipped></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > <snipped></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> <snipped></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> Thank you Mattias for the comments and question, please let me</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> try to explain the same below</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> Based on the intention to bring in CPU lcores which share same</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> L3 (for better cache hits and less noisy neighbor) current API</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> focuses on using</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> Last Level Cache. But if the suggestion is `there are SoC where</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> L2 cache are also shared, and the new API should be</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> provisioned`, I am also</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>> comfortable with the thought.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >>></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>,</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > >> I think we are better off with no DPDK API at all for this kind of</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> functionality.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > Hi Mattias, as shared in the earlier email thread, this is not a</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > AMD special</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > case at all. Let me try to explain this one more time. One of</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > techniques used to increase cores cost effective way to go for tiles of</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> compute complexes.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > This introduces a bunch of cores in sharing same Last Level Cache</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > (namely</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > L2, L3 or even L4) depending upon cache topology architecture.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > The API suggested in RFC is to help end users to selectively use</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > > cores under</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > same Last Level Cache Hierarchy as advertised by OS (irrespective of</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > the BIOS settings used). This is useful in both bare-metal and container</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> environment.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > I'm pretty familiar with AMD CPUs and the use of tiles (including</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > the challenges these kinds of non-uniformities pose for work scheduling).</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > To maximize performance, caring about core<->LLC relationship may</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > well not be enough, and more HT/core/cache/memory topology</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > information is required. That's what I meant by special case. A</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > proper API should allow access to information about which lcores are</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > SMT siblings, cores on the same L2, and cores on the same L3, to</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > name a few things. Probably you want to fit NUMA into the same API</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > > as well, although that is available already in <rte_lcore.h>.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > Thank you Mattias for the information, as shared by in the reply with</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> extra argument `u32 flags`.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> RTE_GET_LCORE_BOOST_DISABLED.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> ></span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> clearer name than just adding "ex" on to the end of the existing function?</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Thank you Bruce, Please find my answer below</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Functions shared as per the RFC were</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">```</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> - rte_get_llc_lcore: Retrieves all lcores that share the LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">```</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">MACRO’s extending the usability were </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">```</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">```</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Based on the discussions we agreed on sharing version-2 FRC for extending API as `rte_get_next_lcore_extnd` with extra argument as `flags`.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">As per my ideation, for the API ` rte_get_next_sibling_core`, the above API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right understanding?</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which allows to iterate SMT sibling threads.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> Looking logically, I'm not sure about the BOOST_ENABLED and</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> BOOST_DISABLED flags you propose</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> - in a system with multiple possible</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> standard and boost frequencies what would those correspond to?</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">I now understand the confusion, apologies for mixing the AMD EPYC SoC boost with Intel Turbo.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Thank you for pointing out, we will use the terminology ` RTE_GET_LCORE_TURBO`.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> What's also</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> missing is a define for getting actual NUMA siblings i.e. those sharing common</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> memory but not an L3 or anything else.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">This can be extended into `rte_get_next_lcore_extnd` with flag ` RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same sub-memory NUMA as shared by LCORE.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">If SMT sibling is enabled and DPDK Lcore mask covers the sibling threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads under same memory NUMA of lcore shared.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> My suggestion would be to have the function take just an integer-type e.g.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> uint16_t parameter which defines the memory/cache hierarchy level to use, 0</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> being lowest, 1 next, and so on. Different systems may have different numbers</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> of cache levels so lets just make it a zero-based index of levels, rather than</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> giving explicit defines (except for memory which should probably always be</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> last). The zero-level will be for "closest neighbour"</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">Good idea, we did prototype this internally. But issue it will keep on adding the number of API into lcore library.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">To keep the API count less, we are using lcore id as hint to sub-NUMA.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> whatever that happens to be, with as many levels as is necessary to express</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> node, then level 2 would give the NUMA siblings. We'd just need an API to</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> return the max number of levels along with the iterator.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">We are using lcore numa as the hint.</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> </span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> Regards,</span></font></div> <div><font face="Calibri" size="2"><span style="font-size:11pt;">> /Bruce</span></font></div> </span></font> </body> </html>