[PATCH v9 1/7] eal: add static per-lcore memory allocation facility
Thomas Monjalon
thomas at monjalon.net
Fri Oct 11 11:11:50 CEST 2024
11/10/2024 10:04, Mattias Rönnblom:
> On 2024-10-10 23:24, Thomas Monjalon wrote:
> > Hello,
> >
> > This new feature looks to bring something interesting to DPDK.
> > There was a good amount of discussion and review,
> > and there is a real effort of documentation.
> >
> > However, some choices done in this implementation
> > were not explained or advertised enough in the documentation,
> > in my opinion.
> >
>
> Are those of relevance to the API user?
I think it helps to understand when we should use this API.
Such design explanation may come in the prog guide RST file.
> > I think the first thing to add is an explanation of the memory layout.
> > Maybe that a SVG drawing would help to show how it is stored.
>
> That would be helpful to someone wanting to understand the internals.
> But where should that go? If it's put in the API, it will also obscure
> the *actual* API documentation.
Of course not in API doc.
I'm talking about moving a lot of explanations in the prog guide,
and add a bit more about the layout.
> I have some drawings already, and I agree they are very helpful - both
> in explaining how things work, and making obvious why the memory layout
> resulting from the use of lcore variables are superior to that of the
> lcore id-index static array approach.
Cool, please add some in the prog guide.
> > We also need to explain why it is not using rte_malloc.
> >
> > Also please could you re-read the doc and comments in detail?
> > I think some words are missing and there are typos.
> > While at it, please allow for easy update of the text
> > by starting each sentence on a new line.
> > Breaking lines logically is better for future patches.
> > One more advice: avoid very long sentences.
>
> I've gone through the documentation and will post a new patch set.
OK thanks.
> There's been a lot of comments and discussion on this patch set. Did you
> have anything in particular in mind?
Nothing more than what I raised in this review.
> > Do you have benchmarks results of the modules using such variables
> > (power, random, service)?
> > It would be interesting to compare time efficiency and memory usage
> > before/after, with different number of threads.
> >
>
> I have the dummy modules of test_lcore_var_perf.c, which show the
> performance benefits as the number of modules using lcore variables
> increases.
>
> That said, the gains are hard to quantify with micro benchmarks, and for
> real-world performance, one really has to start using the facility at
> scale before anything interesting may happen.
>
> Keep in mind however, that while this is new to DPDK, similar facilities
> already exists your favorite UN*X kernel. The implementation is
> different, but I think it's accurate to say the goal and the effects
> should be the same.
>
> One can also run the perf autotest for RTE random, but such tests only
> show lcore variables doesn't make things significantly worse when the L1
> cache is essentially unused. (In fact, the lcore variable-enabled
> rte_random.c somewhat counter-intuitively generates a 64-bit number 1
> TSC cycle faster than the old version on my system.)
>
> Just to be clear: it's the footprint in the core-private caches we are
> attempting to reduce.
OK
> > 10/10/2024 16:21, Mattias Rönnblom:
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >
> > I find it difficult to read "lcore id-equipped thread".
> > Can we just say "DPDK thread"?
>
> Sure, if you point me to a definition of what a DPDK thread is.
>
> I can think of at least four potential definitions
> * An EAL thread
> * An EAL thread or a registered non-EAL thread
> * Any thread calling into DPDK APIs
> * Any thread living in a DPDK process
OK I understand your point.
If we move the design explanations in the prog guide,
we can explain this point in the introduction of the chapter.
> > [...]
> >> +An application, a DPDK library or PMD may keep opt to keep per-thread
> >> +state.
> >
> > I don't understand this sentence.
>
> Which part is unclear?
"keep opt to keep per-thread"
What do you mean?
[...]
> >> +Variables with thread-local storage are allocated at the time of
> >> +thread creation, and exists until the thread terminates, for every
> >> +thread in the process. Only very small object should be allocated in
> >> +TLS, since large TLS objects significantly slows down thread creation
> >> +and may needlessly increase memory footprint for application that make
> >> +extensive use of unregistered threads.
> >
> > I don't understand the relation with non-DPDK threads.
>
> __thread isn't just for "DPDK threads". It will allocate memory on all
> threads in the process.
OK
May be good to add as a note.
> > [...]
> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> >
> > With #define RTE_MAX_LCORE_VAR 1048576,
> > LCORE_BUFFER_SIZE can be 100MB, right?
> >
>
> Sure. Unless you mlock the memory, it won't result in the DPDK process
> having 100MB worth of mostly-unused resident memory (RSS, in Linux
> speak). It would, would we use huge pages and thus effectively disabled
> demand paging.
>
> This is similar to how thread stacks generally work, where you often get
> a fairly sizable stack (e.g., 2MB) but as long as you don't use all of
> it, most of the pages won't be resident.
>
> If you want to guard against such mlocked scenarios, you could consider
> lowering the max variable size. You could argue it's strange to have a
> large RTE_MAX_LCORE_VAR and yet tell the API user to only use it for
> small, often-used block of memory.
>
> If RTE_MAX_LCORE_VAR should have a different value, what should it be?
That's fine
> >> +static void *lcore_buffer;
> >
> > It is the last buffer for all lcores.
> > The name suggests it is one single buffer per lcore.
> > What about "last_buffer" or "current_buffer"?
>
> Would "value_buffer" be better? Or "values_buffer", although that sounds
> awkward. "current_value_buffer".
>
> I agree lcore_buffer is very generic.
>
> The buffer holds values for all lcore ids, for one or (usually many)
> more lcore variables.
So you don't need to mention "lcore" in this variable.
The most important is that it is the last buffer allocated IMHO.
> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> >
> > A comment may be useful for this value: it triggers the first alloc?
>
> Yes. I will add a comment.
>
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> + void *handle;
> >> + unsigned int lcore_id;
> >> + void *value;
> >> +
> >> + offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> + if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> + lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> >> + RTE_CACHE_LINE_SIZE);
> >> +#else
> >> + lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> + LCORE_BUFFER_SIZE);
> >> +#endif
> >> + RTE_VERIFY(lcore_buffer != NULL);
> >
> > Please no panic in a lib.
> > You can return NULL.
>
> One could, but it would be a great cost to the API user.
>
> Something is seriously broken if these kind of allocations fail
> (considering when they occur and what size they are), just like
> something is seriously broken if the kernel fails (or is unwilling to)
> allocate pages used by static lcore id index arrays.
As said in another email,
we may return NULL in this function and have RTE_VERIFY
in case of declarations for ease of such API.
So the user has a choice to use an API which returns an error
or a simpler one with macros.
> > [...]
> >> +#ifndef _RTE_LCORE_VAR_H_
> >> +#define _RTE_LCORE_VAR_H_
> >
> > Really we don't need the first and last underscores,
> > but it's a detail.
>
> I just follow the DPDK conventions here.
>
> I agree the conventions are wrong.
Such conventions are not consistent. Let's do the right thing.
> >> +
> >> +/**
> >> + * @file
> >> + *
> >> + * RTE Lcore variables
> >
> > Please don't say "RTE", it is just a prefix.
>
> OK.
>
> I just follow the DPDK conventions here as well, but sure, I'll change it.
Not really a convention.
> > You can replace it with "DPDK" if you really want to be specific.
> >
> >> + *
> >> + * This API provides a mechanism to create and access per-lcore id
> >> + * variables in a space- and cycle-efficient manner.
> >> + *
> >> + * A per-lcore id variable (or lcore variable for short) has one value
> >> + * for each EAL thread and registered non-EAL thread. There is one
> >> + * instance for each current and future lcore id-equipped thread, with
> >> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable
> >> + * for a particular lcore id is independent from other values (for
> >> + * other lcore ids) within the same lcore variable.
> >> + *
> >> + * In order to access the values of an lcore variable, a handle is
> >> + * used. The type of the handle is a pointer to the value's type
> >> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> >> + * <code>uint32_t *</code>. The handle type is used to inform the
> >> + * access macros the type of the values. A handle may be passed
> >> + * between modules and threads just like any pointer, but its value
> >> + * must be treated as a an opaque identifier. An allocated handle
> >> + * never has the value NULL.
> >
> > Most of the explanations here would be better hosted in the prog guide.
> > The Doxygen API is better suited for short and direct explanations.
>
> Yeah, maybe. Reworking this to the programming guide format and having
> that reviewed is a sizable undertaking though.
It is mostly a matter of moving text.
I'm on it, I can review quickly.
> >> + *
> >> + * @b Creation
> >> + *
> >> + * An lcore variable is created in two steps:
> >> + * 1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> >> + * 2. Allocate lcore variable storage and initialize the handle with
> >> + * a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> >> + * @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> >
> > *at* the time
> >
> >> + * module initialization, but may be done at any time.
> >
> > You mean it does not depend on EAL initialization?
>
> Lcore variables may be used prior to any other parts of the EAL have
> been initialized.
Please make it explicit.
> >> + *
> >> + * An lcore variable is not tied to the owning thread's lifetime. It's
> >> + * available for use by any thread immediately after having been
> >> + * allocated, and continues to be available throughout the lifetime of
> >> + * the EAL.
> >> + *
> >> + * Lcore variables cannot and need not be freed.
> >
> > I'm curious about that.
> > If EAL is closed, and the application continues its life,
> > then we want all this memory to be cleaned as well.
> > Do you know rte_eal_cleanup()?
>
> I think the primary reason you would like to free the buffers is to
> avoid false positives from tools like valgrind memcheck (if anyone
> managed to get that working with DPDK).
>
> rte_eal_cleanup() freeing the buffers and resetting the offset would
> make sense. That however would require the buffers to be tracked (e.g.,
> as a linked list).
>
> From a footprint point of view, TLS allocations and static arrays also
> aren't freed by rte_eal_cleanup().
They are not dynamic as this one.
I still think it is required.
Think about an application starting and stopping some DPDK modules,
It would be a serious leakage.
> >> + * @b Access
> >> + *
> >> + * The value of any lcore variable for any lcore id may be accessed
> >> + * from any thread (including unregistered threads), but it should
> >> + * only be *frequently* read from or written to by the owner.
> >
> > Would be interesting to explain why.
>
> This is intended to be brief and false sharing is mentioned elsewhere.
>
> >> + *
> >> + * Values of the same lcore variable but owned by two different lcore
> >> + * ids may be frequently read or written by the owners without risking
> >> + * false sharing.
> >
> > Again you could explain why if you explained the storage layout.
> > What is the minimum object size to avoid false sharing?
>
> Your objects may be as small as you want, and you still do not risk
> false sharing. All objects for a particular lcore id are grouped
> together, spatially.
[...]
> >> + * are allocated from the libc heap. Heap allocation failures are
> >> + * treated as fatal.
> >
> > Why not handling as an error, so the app has a chance to cleanup before crash?
> >
>
> Because you don't want to put the burden on the user (app or
> DPDK-internal) to attempt to clean up such failures, which in practice
> will never occur, and in case they do, they are just among several such
> early-memory-allocation failures where the application code has no say
> in what should occur.
>
> What happens if the TLS allocations are so large, the main thread can't
> be created?
>
> What happens if the BSS section is so large (because of all our
> RTE_MAX_LCORE-sized arrays) so its pages can't be made resident in memory?
>
> Lcore variables aren't a dynamic allocation facility.
I understand that and I agree.
In case someone is using it as a dynamic facility with the function,
can we offer them a NULL return?
[...]
> >> +/**
> >> + * Allocate space for an lcore variable, and initialize its handle.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >
> > The lcore variables are initialized to zero, not the values.
> >
>
> "The lcore variables are initialized to zero" is the same as "The lcore
> variables' values are initialized to zero" in my world, since the only
> thing that can be initialized in a lcore variable is its values (or
> "value instances" or just "instances", not sure I'm consistent here).
OK
> > Don't you mention 0 in align?
>
> I don't understand the question. Are you asking why objects are
> worst-case aligned when RTE_LCORE_VAR_ALLOC_SIZE() is used? Rather than
> naturally aligned?
No I just mention that 0 align value is not documented here.
> Good question, in that case. I guess it would make more sense if they
> were naturally aligned. I just thought in terms of malloc() semantics,
> but maybe that's wrong.
[...]
> >> +#define RTE_LCORE_VAR_INIT(name) \
> >> + RTE_INIT(rte_lcore_var_init_ ## name) \
> >> + { \
> >> + RTE_LCORE_VAR_ALLOC(name); \
> >> + }
> >
> > I don't get the need for RTE_INIT macros.
>
> Check rte_power_intrinsics.c
I'll check later.
> I agree it's not obvious they are worth the API clutter.
>
> > It does not cover RTE_INIT_PRIO and anyway
> > another RTE_INIT is probably already there in the module.
> >> + */
> >> +static inline void *
> >> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> >
> > What a long name!
> > What about rte_lcore_var() ?
> >
>
> It's long but consistent with the rest of the API.
>
> This is not a function you will be see called often in API user code.
> Most will use the access macros.
I let you discuss naming with Morten.
It seems he agrees with me about making it short.
More information about the dev
mailing list