[PATCH v4 1/4] eal: add generic support for reading PMU events

Morten Brørup mb at smartsharesystems.com
Wed Dec 14 11:41:24 CET 2022
Previous message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Next message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
+CC: Mattias, see my comment below about per-thread constructor for this

> From: Tomasz Duszynski [mailto:tduszynski at marvell.com]
> Sent: Wednesday, 14 December 2022 10.39
> 
> Hello Morten,
> 
> Thanks for review. Answers inline.
> 
> [...]
> 
> > > +/**
> > > + * @file
> > > + *
> > > + * PMU event tracing operations
> > > + *
> > > + * This file defines generic API and types necessary to setup PMU
> and
> > > + * read selected counters in runtime.
> > > + */
> > > +
> > > +/**
> > > + * A structure describing a group of events.
> > > + */
> > > +struct rte_pmu_event_group {
> > > +	int *fds; /**< array of event descriptors */
> > > +	void **mmap_pages; /**< array of pointers to mmapped
> > > perf_event_attr structures */
> >
> > There seems to be a lot of indirection involved here. Why are these
> arrays not statically sized,
> > instead of dynamically allocated?
> >
> 
> Different architectures/pmus impose limits on number of simultaneously
> enabled counters. So in order
> relief the pain of thinking about it and adding macros for each and
> every arch I decided to allocate
> the number user wants dynamically. Also assumption holds that user
> knows about tradeoffs of using
> too many counters hence will not enable too many events at once.

The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the fast path, for performance reasons.

Please use fixed size arrays instead of dynamically allocated arrays.

> 
> > Also, what is the reason for hiding the type struct
> perf_event_mmap_page **mmap_pages opaque by
> > using void **mmap_pages instead?
> 
> I think, that part doing mmap/munmap was written first hence void **
> was chosen in the first place.

Please update it, so the actual type is reflected here.

> 
> >
> > > +	bool enabled; /**< true if group was enabled on particular lcore
> > > */
> > > +};
> > > +
> > > +/**
> > > + * A structure describing an event.
> > > + */
> > > +struct rte_pmu_event {
> > > +	char *name; /** name of an event */
> > > +	int index; /** event index into fds/mmap_pages */
> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > > +
> > > +/**
> > > + * A PMU state container.
> > > + */
> > > +struct rte_pmu {
> > > +	char *name; /** name of core PMU listed under
> > > /sys/bus/event_source/devices */
> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> > > event group data */
> > > +	int num_group_events; /**< number of events in a group */
> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> > > events */

The event_list is used in slow path only, so it can remain a list - i.e. no change requested here. :-)

> > > +};
> > > +
> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > > +*rte_pmu;
> >
> > Again, why not just extern struct rte_pmu, instead of dynamic
> allocation?
> >
> 
> No strong opinions here since this is a matter of personal preference.
> Can be removed
> in the next version.

Yes, please.

> 
> > > +
> > > +/** Each architecture supporting PMU needs to provide its own
> version
> > > */
> > > +#ifndef rte_pmu_pmc_read
> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Read PMU counter.
> > > + *
> > > + * @param pc
> > > + *   Pointer to the mmapped user page.
> > > + * @return
> > > + *   Counter value read from hardware.
> > > + */
> > > +__rte_internal
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > > +	uint64_t offset, width, pmc = 0;
> > > +	uint32_t seq, index;
> > > +	int tries = 100;
> > > +
> > > +	for (;;) {

As a matter of personal preference, I would write this loop differently:

+ for (tries = 100; tries != 0; tries--) {

> > > +		seq = pc->lock;
> > > +		rte_compiler_barrier();
> > > +		index = pc->index;
> > > +		offset = pc->offset;
> > > +		width = pc->pmc_width;
> > > +
> > > +		if (likely(pc->cap_user_rdpmc && index)) {

Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.

[man perf_event_open]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html

> > > +			pmc = rte_pmu_pmc_read(index - 1);
> > > +			pmc <<= 64 - width;
> > > +			pmc >>= 64 - width;
> > > +		}
> > > +
> > > +		rte_compiler_barrier();
> > > +
> > > +		if (likely(pc->lock == seq))
> > > +			return pmc + offset;
> > > +
> > > +		if (--tries == 0) {
> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> > > perf_event_mmap_page lock\n");
> > > +			break;
> > > +		}

- Remove the 4 above lines of code, and move the debug log message to the end of the function instead.

> > > +	}

+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");

> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Enable group of events for a given lcore.
> > > + *
> > > + * @param lcore_id
> > > + *   The identifier of the lcore.
> > > + * @return
> > > + *   0 in case of success, negative value otherwise.
> > > + */
> > > +__rte_internal
> > > +int
> > > +rte_pmu_enable_group(int lcore_id);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Add event to the group of enabled events.
> > > + *
> > > + * @param name
> > > + *   Name of an event listed under
> > > /sys/bus/event_source/devices/pmu/events.
> > > + * @return
> > > + *   Event index in case of success, negative value otherwise.
> > > + */
> > > +__rte_experimental
> > > +int
> > > +rte_pmu_add_event(const char *name);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Read hardware counter configured to count occurrences of an
> event.
> > > + *
> > > + * @param index
> > > + *   Index of an event to be read.
> > > + * @return
> > > + *   Event value read from register. In case of errors or lack of
> > > support
> > > + *   0 is returned. In other words, stream of zeros in a trace
> file
> > > + *   indicates problem with reading particular PMU event register.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read(int index)

The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of the comparison further below in this function.

> > > +{
> > > +	int lcore_id = rte_lcore_id();
> > > +	struct rte_pmu_event_group *group;
> > > +	int ret;
> > > +
> > > +	if (!rte_pmu)
> > > +		return 0;
> > > +
> > > +	group = &rte_pmu->group[lcore_id];
> > > +	if (!group->enabled) {

Optimized: if (unlikely(!group->enabled)) {

> > > +		ret = rte_pmu_enable_group(lcore_id);
> > > +		if (ret)
> > > +			return 0;
> > > +
> > > +		group->enabled = true;
> > > +	}
> >
> > Why is the group not enabled in the setup function,
> rte_pmu_add_event(), instead of here, in the
> > hot path?
> >
> 
> When this is executed for the very first time then cpu will have
> obviously more work to do
> but afterwards setup path is not taken hence much less cpu cycles are
> required.
> 
> Setup is executed by main lcore solely, before lcores are executed
> hence some info passed to
> SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being an
> example here.

OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.

@Mattias: Another good example of something that would belong in per-thread constructors, as my suggested feature creep in [1].

[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smartserver.smartshare.dk/

> 
> > > +
> > > +	if (index < 0 || index >= rte_pmu->num_group_events)

Optimized: if (unlikely(index >= rte_pmu.num_group_events))

> > > +		return 0;
> > > +
> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > > *)group->mmap_pages[index]);
> >
> > Using fixed size arrays instead of multiple indirections via pointers
> is faster. It could be:
> >
> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >
> > With our without suggested performance improvements...
> >
> > Series-acked-by: Morten Brørup <mb at smartsharesystems.com>
>
Previous message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Next message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list