[dpdk-dev] Proposal of mbuf Race Condition Detection

jigsaw jigsaw at gmail.com
Thu Apr 13 15:38:19 CEST 2017


Hi,

I have a proposal for mbuf race condition detection and I would like to get
your opinions before
committing any patch.

Race condition is the worst bug I can think of; as it causes crashing long
after the first crime scene,
and the consequence is unpredictable and difficult to understand.

Besides, race condition is very difficult to reproduce. Usually we get a
coredump from live network,
but the coredump itself delivers nearly zero info. We just know that the
mbuf is somehow broken, and
it is perhaps overwritten by another core at any moment.

There are tools such as Valgrind and ThreadSanitizer to capture this fault.
But the overhead brought
by the tools are too high to be deployed in performance test, not to
mention in the live network. But
race condition rarely happens under low pressure.

Even worse, even in theory, the tools mentioned are not capable of
capturing the scenario, because
they requires explicit calls on locking primitives such as pthread mutex.
This is because the semantics
are not understood by the tools without explicit lock/unlock.

With such known restrictions, I have a proposal roughly as below.

The idea is to ask coder to do explicit lock/unlock on each mbuf that
matters. The pseudo code is as below:

    RTE_MBUF_LOCK(m);
    /* use mbuf as usual */
    ...
    RTE_MBUF_UNLOCK(m);

During the protected critical section, only the core which holds the lock
can access the mbuf; while
other cores, if they dare to use mbuf, they will be punished by segfault.

Since the proposal shall be feasible at least in performance test, the
overhead of locking and
punishment must be small. And the access to mbuf must be as usual.

Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):

RTE_MBUF_LOCK(m)
{
    store_per_core_cache(m, m->buf_addr);
    m->buf_addr = NULL;
    mb(); // memory barrier
}

And RTE_MBUF_UNLOCK is simply the reverse:

RTE_MBUF_UNLOCK(m)
{
    purge_per_core_cache(m);
    m->buf_addr = load_per_core_cache(m);
    mb();
}

As we can see that the idea is to re-write m->buf_addr with NULL, and store
it in a per-core
cache. The other cores, while trying to access the mbuf during critical
section, will be certainly
punished.

Then of course we need to keep the owner core working. This is done by
making changes to
mtod, as below:

#define rte_pktmbuf_mtod_offset(m, t, o) \
    ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
(o)))

The per-core cache of mbuf works like a key-value database, which works
like a direct-mapping
cache. If an eviction happens (a newly arriving mbuf must kick out an old
one), then the we restore
the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
take care of such
situation.

Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed and
allocated, since
we don't trust the refcnt. A double-free is actually very rare; but data
race can occur without double-free. Therefore, we need to survive the
allocation of mbuf, that is rte_pktmbuf_init, which reset the
buf_addr.

Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
But it is not a big deal (I hope).

The runtime overhead of this implementation is very light-weighted, and
probably is able to turned
on even in live network. And it is not intrusive at all: no change is
needed in struct rte_mbuf; we just
need a O(N) space complexity (where N is number of cores), and O(1) runtime
complexity.

Alright, that is basically what is in my mind. Before any patch is
provided, or any perf analysis is made, I would like to know what are the
risks form design point of view. Please leave your feedback.

Thanks &
Regards,

Qinglai


More information about the dev mailing list