[dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
Ananyev, Konstantin
konstantin.ananyev at intel.com
Mon Oct 12 14:50:00 CEST 2020
> >>
> >>>>>>>>> Add two new power management intrinsics, and provide an
> >>>>>>>>> implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet
> >>>>>>>>> widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is
> >>>>>>>>> reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a
> >>>>>>>>> memory
> >>>>>>>>> location is written to. The monitor function also provides an
> >>>>>>>>> optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>
> >>>>>>>> I think what this API is missing - a function to wakeup sleeping
> >>>>>>>> core.
> >>>>>>>> If user can/should use some system call to achieve that, then at
> >>>>>>>> least
> >>>>>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>>>>
> >>>>>>> I don't think it's possible to do that without severely
> >>>>>>> overcomplicating
> >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>>>>> sleeping core would be to send some kind of interrupt to the
> >>>>>>> core, or
> >>>>>>> trigger a write to the cache-line in question.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I think we either need a syscall that would do an IPI for us
> >>>>>> (on top of my head - membarrier() does that, might be there are
> >>>>>> some other syscalls too),
> >>>>>> or something hand-made. For hand-made, I wonder would something
> >>>>>> like that
> >>>>>> be safe and sufficient:
> >>>>>> uint64_t val = atomic_load(addr);
> >>>>>> CAS(addr, val, &val);
> >>>>>> ?
> >>>>>> Anyway, one way or another - I think ability to wakeup core we put
> >>>>>> to sleep
> >>>>>> have to be an essential part of this feature.
> >>>>>> As I understand linux kernel will limit max amount of sleep time
> >>>>>> for these instructions:
> >>>>>> https://lwn.net/Articles/790920/
> >>>>>> But relying just on that, seems too vague for me:
> >>>>>> - user can adjust that value
> >>>>>> - wouldn't apply to older kernels and non-linux cases
> >>>>>> Konstantin
> >>>>>>
> >>>>>
> >>>>> This implies knowing the value the core is sleeping on.
> >>>>
> >>>> You don't the value to wait for, you just need an address.
> >>>> And you can make wakeup function to accept address as a parameter,
> >>>> same as monitor() does.
> >>>
> >>> Sorry, i meant the address. We don't know the address we're sleeping on.
> >>>
> >>>>
> >>>>> That's not
> >>>>> always the case - with this particular PMD power management scheme, we
> >>>>> get the address from the PMD and it stays inside the callback.
> >>>>
> >>>> That's fine - you can store address inside you callback metadata
> >>>> and do wakeup as part of _disable_ function.
> >>>>
> >>>
> >>> The address may be different, and by the time we access the address it
> >>> may become stale, so i don't see how that would help unless you're
> >>> suggesting to have some kind of synchronization mechanism there.
> >>
> >> Yes, we'll need something to sync here for sure.
> >> Sorry, I should say it straightway, to avoid further misunderstanding.
> >> Let say, associate a spin_lock with monitor(), by analogy with
> >> pthread_cond_wait().
> >> Konstantin
> >>
> >
> > The idea was to provide an intrinsic-like function - as in, raw
> > instruction call, without anything extra. We even added the masks/values
> > etc. only because there's no race-less way to combine UMONITOR/UMWAIT
> > without those.
>>
> > Perhaps we can provide a synchronize-able wrapper around it to avoid
> > adding overhead to calls that function but doesn't need the sync mechanism?
Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync()
or whatever would be a better name.
> >
>
> Also, how would having a spinlock help to synchronize? Are you
> suggesting we do UMWAIT on a spinlock address, or something to that effect?
>
I thought about something very similar to cond_wait() working model:
/*
* Caller has to obtain lock before calling that function.
*/
static inline int rte_power_monitor_sync(const volatile void *p,
const uint64_t expected_value, const uint64_t value_mask,
const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
{
/* do whatever preparations are needed */
....
umonitor(p);
if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
return 0;
}
/* release lock and go to sleep */
rte_spinlock_unlock(lck);
rflags = umwait();
/* grab lock back after wakeup */
rte_spinlock_lock(lck);
/* do rest of processing */
....
}
/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
uint64_t v;
v = __atomic_load_n(p, __ATOMIC_RELAXED);
__atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}
Now in librte_power:
struct pmd_queue_cfg {
/* to protect state and wait_addr */
rte_spinlock_t lck;
enum pmd_mgmt_state pwr_mgmt_state;
void *wait_addr;
/* rest fields */
....
} __rte_cache_aligned;
static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
uint16_t max_pkts __rte_unused, void *_ __rte_unused)
{
struct pmd_queue_cfg *q_conf;
q_conf = &port_cfg[port_id].queue_cfg[qidx];
if (unlikely(nb_rx == 0)) {
q_conf->empty_poll_stats++;
if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
volatile void *target_addr;
uint64_t expected, mask;
uint16_t ret;
/* grab the lock and check the state */
rte_spinlock_lock(&q_conf->lck);
If (q-conf->state == ENABLED) {
ret = rte_eth_get_wake_addr(port_id, qidx,
&target_addr, &expected, &mask);
If (ret == 0) {
q_conf->wait_addr = target_addr;
rte_power_monitor(target_addr, ..., &q_conf->lck);
}
/* reset the wait_addr */
q_conf->wait_addr = NULL;
}
rte_spinlock_unlock(&q_conf->lck);
....
}
nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
uint16_t port_id,
uint16_t queue_id)
{
...
/* grab the lock and change the state */
rte_spinlock_lock(&q_conf->lck);
queue_cfg->state = DISABLED;
/* wakeup if necessary */
If (queue_cfg->wakeup_addr != NULL)
rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
rte_spinlock_unlock(&q_conf->lck);
...
}
More information about the dev
mailing list