[dpdk-dev] Proposal of mbuf Race Condition Detection

Stephen Hemminger stephen at networkplumber.org
Fri Apr 14 02:03:06 CEST 2017


On Thu, 13 Apr 2017 23:19:45 +0000
"Ananyev, Konstantin" <konstantin.ananyev at intel.com> wrote:

> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of jigsaw
> > Sent: Thursday, April 13, 2017 9:59 PM
> > To: Adrien Mazarguil <adrien.mazarguil at 6wind.com>
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
> > 
> > Hi Adrien,
> > 
> > Thanks for your comment.
> > 
> > The LOCK/UNLOCK may be called by user application only. There are several
> > reasons.
> > 
> > 1. If the lib calls LOCK, user application may be punished unexpectedly.
> > Consider what if the Rx burst function calls the LOCK in core #1, and then
> > the mbuf is handed over from
> > core #1 to core #2 through a enqueue/dequeue operation, as below:
> > 
> > Rx(1) --> Enqueue(1) --> Dequeue(2)
> > LOCKED                       Panic!
> > 
> > The core #2 will then panic because the mbuf is owned by core #1 without
> > being released.
> > 
> > 2. Rx and Tx both usually works in a burst mode, combined with prefetch
> > operation. Meanwhile
> > LOCK and UNLOCK cannot work well with prefetch, because it requires data
> > access of mbuf header.
> > 
> > 3. The critical session tends to be small. Here we (user application) need
> > to find a balance: with longer interval of critical
> > section, we can have more secured code; however, longer duration leads to
> > more complicated business
> > logic.
> > 
> > Hence, we urge user application to use LOCK/UNLOCK with the common sense of
> > critical sections.
> > 
> > Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
> > below:
> > 
> > ==========================================
> > diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > index 7714a20..9db0190 100644
> > --- a/examples/l3fwd/l3fwd_em_hlm_sse.h
> > +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
> > @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > **pkts_burst,
> > 
> >         for (j = 0; j < n; j += 8) {
> > 
> > +            RTE_MBUF_LOCK(pkts_burst[j]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+1]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+2]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+3]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+4]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+5]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+6]);
> > +            RTE_MBUF_LOCK(pkts_burst[j+7]);
> > +
> >                 uint32_t pkt_type =
> >                         pkts_burst[j]->packet_type &
> >                         pkts_burst[j+1]->packet_type &
> > @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> > **pkts_burst,
> >                 }
> >         }
> > 
> > -       for (; j < nb_rx; j++)
> > +      for (; j < nb_rx; j++) {
> > +              RTE_MBUF_LOCK(pkts_burst[j]);
> >                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
> > +      }
> > 
> >         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> > 
> > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > index 1afa1f0..2938558 100644
> > --- a/examples/l3fwd/l3fwd_sse.h
> > +++ b/examples/l3fwd/l3fwd_sse.h
> > @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port,
> > struct rte_mbuf *m[],
> > 
> >         len = qconf->tx_mbufs[port].len;
> > 
> > +      for (j = 0; j < num; ++j)
> > +            RTE_MBUF_UNLOCK(m);
> > +
> >         /*
> >          * If TX buffer for that queue is empty, and we have enough packets,
> >          * then send them straightway.
> > @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct
> > rte_mbuf **pkts_burst,
> >                 if (likely(pn != BAD_PORT))
> >                         send_packetsx4(qconf, pn, pkts_burst + j, k);
> >                 else
> > -                       for (m = j; m != j + k; m++)
> > +                      for (m = j; m != j + k; m++) {
> > +                              RTE_MBUF_UNLOCK(pkts_burst[m]);
> >                                 rte_pktmbuf_free(pkts_burst[m]);
> > +                      }
> > 
> >         }
> >  }
> > ==========================================
> > 
> > Note that data race may or may not have visible consequence. If two cores
> > unconsciously process same mbuf
> > at different time, they may never notice it; but if two cores access same
> > mbuf at the same physical time, the
> > consequence is usually visible (crash). We don't seek for a solution that
> > captures even potential data race;
> > instead, we seek for a solution that can capture data race that happens
> > simultaneously in two or more cores.
> > Therefore, we do not need to extend the border of locking as wide as
> > possible. We will apply locking only when
> > we are focusing on a single mbuf processing.
> > 
> > In a real life application, a core will spend quite some time on each mbuf.
> > The interval ranges from a few hundred
> > cycles to a few thousand cycles. And usually not more than a handful of
> > mbuf are involved. This is a ideal use case
> > for locking mbuf.
> > 
> > I agree that the race detection shall not be compiled by default, since
> > mtod is often called, and every mtod implies a
> > visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
> > supported as well. But I don't think refcnt logic
> > should be taken into account; these two are orthogonal designs, IMO. *****
> > Pls correct me if I am wrong here. *****
> > 
> > Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That
> > is why I said LOCK/UNLOCK needs to
> > survive mbuf alloc initialization.
> > 
> > Of course we need to support locking multiple mbufs at the same time. For
> > each core, we will then preserve, say, 8 slots.
> > It works exactly like a direct mapped cacheline. That is, we can use 4bits
> > from the mbuf address to locate its cacheline.
> > If the cacheline has been occupied, we do an eviction; that is, the new
> > mbuf will take the place of the old one. The old one is
> > then UNLOCKed, unfortunately.
> > 
> > Honestly I have not yet tried this approach in real life application. But I
> > have been thinking over the problem of data race detection
> > for a long time, and I found the restriction and requirement makes this
> > solution the only viable one. There are hundreds of papers
> > published in the field on data race condition detection, but the lightest
> > of the options has at least 5x of performance
> > penalty, not to mention the space complexity, making it not applicable in
> > practice.
> > 
> > Again, pls, anyone has same painful experience of data race bugs, share
> > with me your concerns. It would be nice to come up with
> > some practical device to address this problem. I believe 6Wind and other IP
> > stack vendors must share the same feeling and opinion.
> > 
> > thx &
> > rgds,
> > 
> > Qinglai
> > 
> > 
> > 
> > On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <  
> > adrien.mazarguil at 6wind.com> wrote:  
> >   
> > > Hi Qinglai,
> > >
> > > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:  
> > > > Hi,
> > > >
> > > > I have a proposal for mbuf race condition detection and I would like to  
> > > get  
> > > > your opinions before
> > > > committing any patch.
> > > >
> > > > Race condition is the worst bug I can think of; as it causes crashing  
> > > long  
> > > > after the first crime scene,
> > > > and the consequence is unpredictable and difficult to understand.
> > > >
> > > > Besides, race condition is very difficult to reproduce. Usually we get a
> > > > coredump from live network,
> > > > but the coredump itself delivers nearly zero info. We just know that the
> > > > mbuf is somehow broken, and
> > > > it is perhaps overwritten by another core at any moment.
> > > >
> > > > There are tools such as Valgrind and ThreadSanitizer to capture this  
> > > fault.  
> > > > But the overhead brought
> > > > by the tools are too high to be deployed in performance test, not to
> > > > mention in the live network. But
> > > > race condition rarely happens under low pressure.
> > > >
> > > > Even worse, even in theory, the tools mentioned are not capable of
> > > > capturing the scenario, because
> > > > they requires explicit calls on locking primitives such as pthread mutex.
> > > > This is because the semantics
> > > > are not understood by the tools without explicit lock/unlock.
> > > >
> > > > With such known restrictions, I have a proposal roughly as below.
> > > >
> > > > The idea is to ask coder to do explicit lock/unlock on each mbuf that
> > > > matters. The pseudo code is as below:
> > > >
> > > >     RTE_MBUF_LOCK(m);
> > > >     /* use mbuf as usual */
> > > >     ...
> > > >     RTE_MBUF_UNLOCK(m);
> > > >
> > > > During the protected critical section, only the core which holds the lock
> > > > can access the mbuf; while
> > > > other cores, if they dare to use mbuf, they will be punished by segfault.
> > > >
> > > > Since the proposal shall be feasible at least in performance test, the
> > > > overhead of locking and
> > > > punishment must be small. And the access to mbuf must be as usual.
> > > >
> > > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):
> > > >
> > > > RTE_MBUF_LOCK(m)
> > > > {
> > > >     store_per_core_cache(m, m->buf_addr);
> > > >     m->buf_addr = NULL;
> > > >     mb(); // memory barrier
> > > > }
> > > >
> > > > And RTE_MBUF_UNLOCK is simply the reverse:
> > > >
> > > > RTE_MBUF_UNLOCK(m)
> > > > {
> > > >     purge_per_core_cache(m);
> > > >     m->buf_addr = load_per_core_cache(m);
> > > >     mb();
> > > > }
> > > >
> > > > As we can see that the idea is to re-write m->buf_addr with NULL, and  
> > > store  
> > > > it in a per-core
> > > > cache. The other cores, while trying to access the mbuf during critical
> > > > section, will be certainly
> > > > punished.
> > > >
> > > > Then of course we need to keep the owner core working. This is done by
> > > > making changes to
> > > > mtod, as below:
> > > >
> > > > #define rte_pktmbuf_mtod_offset(m, t, o) \
> > > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
> > > > (o)))
> > > >
> > > > The per-core cache of mbuf works like a key-value database, which works
> > > > like a direct-mapping
> > > > cache. If an eviction happens (a newly arriving mbuf must kick out an old
> > > > one), then the we restore
> > > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
> > > > take care of such
> > > > situation.
> > > >
> > > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed  
> > > and  
> > > > allocated, since
> > > > we don't trust the refcnt. A double-free is actually very rare; but data
> > > > race can occur without double-free. Therefore, we need to survive the
> > > > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > > > buf_addr.
> > > >
> > > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
> > > > But it is not a big deal (I hope).
> > > >
> > > > The runtime overhead of this implementation is very light-weighted, and
> > > > probably is able to turned
> > > > on even in live network. And it is not intrusive at all: no change is
> > > > needed in struct rte_mbuf; we just
> > > > need a O(N) space complexity (where N is number of cores), and O(1)  
> > > runtime  
> > > > complexity.
> > > >
> > > > Alright, that is basically what is in my mind. Before any patch is
> > > > provided, or any perf analysis is made, I would like to know what are the
> > > > risks form design point of view. Please leave your feedback.  
> > >
> > > Your proposal makes sense but I'm not sure where developers should call
> > > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications
> > > only? Is it to be used internally by DPDK as well? Does it include PMDs?
> > >
> > > I think any overhead outside of debugging mode, as minimal as it is, is too
> > > much of a "punishment" for well behaving applications. The cost of a memory
> > > barrier can be extremely high and this is one of the reasons DPDK processes
> > > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> > > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled
> > > by default, and thought as an additional debugging tool.
> > >
> > > Also the implementation would probably be more expensive than your
> > > suggestion, in my opinion the reference count must be taken into account
> > > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> > > work. Freeing mbufs multiple times is perfectly valid in many cases.
> > >
> > > How can one lock several mbufs at once by the way?
> > >
> > > Since it affects performance, this can only make sense as a debugging tool
> > > developers can use when they encounter some corruption issue they cannot
> > > identify, somewhat like poisoning buffers before they are freed. Not sure
> > > it's worth the trouble.  
> 
> I am agree with Adrian here - it doesn't  look worth it:
> From one side it adds quite a lot of overhead and complexity to the mbuf code.
> From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be
> able to detect race condition anyway..
> So my opinion is NACK.
> BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads
> without some explicit synchronization, then most likely there is something wrong with it.
> Konstantin

This is a one-off debug thing. If you have these kind of issues, you really need to rethink
your threading design. The examples have clear thread model. If you follow the documentation
about process model, and don't try and reinvent your own, then this kind of hand holding
is not necessary.

NAK as well.


More information about the dev mailing list