[dpdk-dev] Proposal of mbuf Race Condition Detection

jigsaw jigsaw at gmail.com
Fri Apr 14 10:51:26 CEST 2017
Previous message: [dpdk-dev] Proposal of mbuf Race Condition Detection
Next message: [dpdk-dev] [PATCH] net/virtio-user: fix not working on 32-bit system
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all,

Thanks for the comments. The proposal seems refrained from wide usage in
application.

thx &
rgds,
-qinglai

On Fri, Apr 14, 2017 at 3:03 AM, Stephen Hemminger <
stephen at networkplumber.org> wrote:

> On Thu, 13 Apr 2017 23:19:45 +0000
> "Ananyev, Konstantin" <konstantin.ananyev at intel.com> wrote:
>
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of jigsaw
> > > Sent: Thursday, April 13, 2017 9:59 PM
> > > To: Adrien Mazarguil <adrien.mazarguil at 6wind.com>
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
> > >
> > > Hi Adrien,
> > >
> > > Thanks for your comment.
> > >
> > > The LOCK/UNLOCK may be called by user application only. There are
> several
> > > reasons.
> > >
> > > 1. If the lib calls LOCK, user application may be punished
> unexpectedly.
> > > Consider what if the Rx burst function calls the LOCK in core #1, and
> then
> > > the mbuf is handed over from
> > > core #1 to core #2 through a enqueue/dequeue operation, as below:
> > >
> > > Rx(1) --> Enqueue(1) --> Dequeue(2)
> > > LOCKED                       Panic!
> > >
> > > The core #2 will then panic because the mbuf is owned by core #1
> without
> > > being released.
> > >
> > > 2. Rx and Tx both usually works in a burst mode, combined with prefetch
> > > operation. Meanwhile
> > > LOCK and UNLOCK cannot work well with prefetch, because it requires
> data
> > > access of mbuf header.
> > >
> > > 3. The critical session tends to be small. Here we (user application)
> need
> > > to find a balance: with longer interval of critical
> > > section, we can have more secured code; however, longer duration leads
> to
> > > more complicated business
> > > logic.
> > >
> > > Hence, we urge user application to use LOCK/UNLOCK with the common
> sense of
> > > critical sections.
> > >
> > > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
> > > below:
> > >
> > > ==========================================
> > > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > index 7714a20..9db0190 100644
> > > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > > **pkts_burst,
> > >
> > >         for (j = 0; j < n; j += 8) {
> > >
> > > +            RTE_MBUF_LOCK(pkts_burst[j]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+1]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+2]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+3]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+4]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+5]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+6]);
> > > +            RTE_MBUF_LOCK(pkts_burst[j+7]);
> > > +
> > >                 uint32_t pkt_type =
> > >                         pkts_burst[j]->packet_type &
> > >                         pkts_burst[j+1]->packet_type &
> > > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > > **pkts_burst,
> > >                 }
> > >         }
> > >
> > > -       for (; j < nb_rx; j++)
> > > +      for (; j < nb_rx; j++) {
> > > +              RTE_MBUF_LOCK(pkts_burst[j]);
> > >                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j],
> portid);
> > > +      }
> > >
> > >         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> > >
> > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > index 1afa1f0..2938558 100644
> > > --- a/examples/l3fwd/l3fwd_sse.h
> > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t
> port,
> > > struct rte_mbuf *m[],
> > >
> > >         len = qconf->tx_mbufs[port].len;
> > >
> > > +      for (j = 0; j < num; ++j)
> > > +            RTE_MBUF_UNLOCK(m);
> > > +
> > >         /*
> > >          * If TX buffer for that queue is empty, and we have enough
> packets,
> > >          * then send them straightway.
> > > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf,
> struct
> > > rte_mbuf **pkts_burst,
> > >                 if (likely(pn != BAD_PORT))
> > >                         send_packetsx4(qconf, pn, pkts_burst + j, k);
> > >                 else
> > > -                       for (m = j; m != j + k; m++)
> > > +                      for (m = j; m != j + k; m++) {
> > > +                              RTE_MBUF_UNLOCK(pkts_burst[m]);
> > >                                 rte_pktmbuf_free(pkts_burst[m]);
> > > +                      }
> > >
> > >         }
> > >  }
> > > ==========================================
> > >
> > > Note that data race may or may not have visible consequence. If two
> cores
> > > unconsciously process same mbuf
> > > at different time, they may never notice it; but if two cores access
> same
> > > mbuf at the same physical time, the
> > > consequence is usually visible (crash). We don't seek for a solution
> that
> > > captures even potential data race;
> > > instead, we seek for a solution that can capture data race that happens
> > > simultaneously in two or more cores.
> > > Therefore, we do not need to extend the border of locking as wide as
> > > possible. We will apply locking only when
> > > we are focusing on a single mbuf processing.
> > >
> > > In a real life application, a core will spend quite some time on each
> mbuf.
> > > The interval ranges from a few hundred
> > > cycles to a few thousand cycles. And usually not more than a handful of
> > > mbuf are involved. This is a ideal use case
> > > for locking mbuf.
> > >
> > > I agree that the race detection shall not be compiled by default, since
> > > mtod is often called, and every mtod implies a
> > > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
> > > supported as well. But I don't think refcnt logic
> > > should be taken into account; these two are orthogonal designs, IMO.
> *****
> > > Pls correct me if I am wrong here. *****
> > >
> > > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason.
> That
> > > is why I said LOCK/UNLOCK needs to
> > > survive mbuf alloc initialization.
> > >
> > > Of course we need to support locking multiple mbufs at the same time.
> For
> > > each core, we will then preserve, say, 8 slots.
> > > It works exactly like a direct mapped cacheline. That is, we can use
> 4bits
> > > from the mbuf address to locate its cacheline.
> > > If the cacheline has been occupied, we do an eviction; that is, the new
> > > mbuf will take the place of the old one. The old one is
> > > then UNLOCKed, unfortunately.
> > >
> > > Honestly I have not yet tried this approach in real life application.
> But I
> > > have been thinking over the problem of data race detection
> > > for a long time, and I found the restriction and requirement makes this
> > > solution the only viable one. There are hundreds of papers
> > > published in the field on data race condition detection, but the
> lightest
> > > of the options has at least 5x of performance
> > > penalty, not to mention the space complexity, making it not applicable
> in
> > > practice.
> > >
> > > Again, pls, anyone has same painful experience of data race bugs, share
> > > with me your concerns. It would be nice to come up with
> > > some practical device to address this problem. I believe 6Wind and
> other IP
> > > stack vendors must share the same feeling and opinion.
> > >
> > > thx &
> > > rgds,
> > >
> > > Qinglai
> > >
> > >
> > >
> > > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <
> > > adrien.mazarguil at 6wind.com> wrote:
> > >
> > > > Hi Qinglai,
> > > >
> > > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:
> > > > > Hi,
> > > > >
> > > > > I have a proposal for mbuf race condition detection and I would
> like to
> > > > get
> > > > > your opinions before
> > > > > committing any patch.
> > > > >
> > > > > Race condition is the worst bug I can think of; as it causes
> crashing
> > > > long
> > > > > after the first crime scene,
> > > > > and the consequence is unpredictable and difficult to understand.
> > > > >
> > > > > Besides, race condition is very difficult to reproduce. Usually we
> get a
> > > > > coredump from live network,
> > > > > but the coredump itself delivers nearly zero info. We just know
> that the
> > > > > mbuf is somehow broken, and
> > > > > it is perhaps overwritten by another core at any moment.
> > > > >
> > > > > There are tools such as Valgrind and ThreadSanitizer to capture
> this
> > > > fault.
> > > > > But the overhead brought
> > > > > by the tools are too high to be deployed in performance test, not
> to
> > > > > mention in the live network. But
> > > > > race condition rarely happens under low pressure.
> > > > >
> > > > > Even worse, even in theory, the tools mentioned are not capable of
> > > > > capturing the scenario, because
> > > > > they requires explicit calls on locking primitives such as pthread
> mutex.
> > > > > This is because the semantics
> > > > > are not understood by the tools without explicit lock/unlock.
> > > > >
> > > > > With such known restrictions, I have a proposal roughly as below.
> > > > >
> > > > > The idea is to ask coder to do explicit lock/unlock on each mbuf
> that
> > > > > matters. The pseudo code is as below:
> > > > >
> > > > >     RTE_MBUF_LOCK(m);
> > > > >     /* use mbuf as usual */
> > > > >     ...
> > > > >     RTE_MBUF_UNLOCK(m);
> > > > >
> > > > > During the protected critical section, only the core which holds
> the lock
> > > > > can access the mbuf; while
> > > > > other cores, if they dare to use mbuf, they will be punished by
> segfault.
> > > > >
> > > > > Since the proposal shall be feasible at least in performance test,
> the
> > > > > overhead of locking and
> > > > > punishment must be small. And the access to mbuf must be as usual.
> > > > >
> > > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo
> code):
> > > > >
> > > > > RTE_MBUF_LOCK(m)
> > > > > {
> > > > >     store_per_core_cache(m, m->buf_addr);
> > > > >     m->buf_addr = NULL;
> > > > >     mb(); // memory barrier
> > > > > }
> > > > >
> > > > > And RTE_MBUF_UNLOCK is simply the reverse:
> > > > >
> > > > > RTE_MBUF_UNLOCK(m)
> > > > > {
> > > > >     purge_per_core_cache(m);
> > > > >     m->buf_addr = load_per_core_cache(m);
> > > > >     mb();
> > > > > }
> > > > >
> > > > > As we can see that the idea is to re-write m->buf_addr with NULL,
> and
> > > > store
> > > > > it in a per-core
> > > > > cache. The other cores, while trying to access the mbuf during
> critical
> > > > > section, will be certainly
> > > > > punished.
> > > > >
> > > > > Then of course we need to keep the owner core working. This is
> done by
> > > > > making changes to
> > > > > mtod, as below:
> > > > >
> > > > > #define rte_pktmbuf_mtod_offset(m, t, o) \
> > > > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) +
> (m)->data_off +
> > > > > (o)))
> > > > >
> > > > > The per-core cache of mbuf works like a key-value database, which
> works
> > > > > like a direct-mapping
> > > > > cache. If an eviction happens (a newly arriving mbuf must kick out
> an old
> > > > > one), then the we restore
> > > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK
> will then
> > > > > take care of such
> > > > > situation.
> > > > >
> > > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is
> freed
> > > > and
> > > > > allocated, since
> > > > > we don't trust the refcnt. A double-free is actually very rare;
> but data
> > > > > race can occur without double-free. Therefore, we need to survive
> the
> > > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > > > > buf_addr.
> > > > >
> > > > > Further, other dereference to buf_addr in rte_mbuf.h need to be
> revised.
> > > > > But it is not a big deal (I hope).
> > > > >
> > > > > The runtime overhead of this implementation is very
> light-weighted, and
> > > > > probably is able to turned
> > > > > on even in live network. And it is not intrusive at all: no change
> is
> > > > > needed in struct rte_mbuf; we just
> > > > > need a O(N) space complexity (where N is number of cores), and O(1)
> > > > runtime
> > > > > complexity.
> > > > >
> > > > > Alright, that is basically what is in my mind. Before any patch is
> > > > > provided, or any perf analysis is made, I would like to know what
> are the
> > > > > risks form design point of view. Please leave your feedback.
> > > >
> > > > Your proposal makes sense but I'm not sure where developers should
> call
> > > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user
> applications
> > > > only? Is it to be used internally by DPDK as well? Does it include
> PMDs?
> > > >
> > > > I think any overhead outside of debugging mode, as minimal as it is,
> is too
> > > > much of a "punishment" for well behaving applications. The cost of a
> memory
> > > > barrier can be extremely high and this is one of the reasons DPDK
> processes
> > > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> > > > RTE_MBUF_UNLOCK() should thus exist under some compilation option
> disabled
> > > > by default, and thought as an additional debugging tool.
> > > >
> > > > Also the implementation would probably be more expensive than your
> > > > suggestion, in my opinion the reference count must be taken into
> account
> > > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> > > > work. Freeing mbufs multiple times is perfectly valid in many cases.
> > > >
> > > > How can one lock several mbufs at once by the way?
> > > >
> > > > Since it affects performance, this can only make sense as a
> debugging tool
> > > > developers can use when they encounter some corruption issue they
> cannot
> > > > identify, somewhat like poisoning buffers before they are freed. Not
> sure
> > > > it's worth the trouble.
> >
> > I am agree with Adrian here - it doesn't  look worth it:
> > From one side it adds quite a lot of overhead and complexity to the mbuf
> code.
> > From other side it usage seems pretty limited - there would be a lot of
> cases when it wouldn't be
> > able to detect race condition anyway..
> > So my opinion is NACK.
> > BTW, if your app is intentionally trying to do read/write to the same
> mbufs simultaneously from different threads
> > without some explicit synchronization, then most likely there is
> something wrong with it.
> > Konstantin
>
> This is a one-off debug thing. If you have these kind of issues, you
> really need to rethink
> your threading design. The examples have clear thread model. If you follow
> the documentation
> about process model, and don't try and reinvent your own, then this kind
> of hand holding
> is not necessary.
>
> NAK as well.
>
Previous message: [dpdk-dev] Proposal of mbuf Race Condition Detection
Next message: [dpdk-dev] [PATCH] net/virtio-user: fix not working on 32-bit system
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list