[dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free

Neil Horman nhorman at tuxdriver.com
Fri Mar 27 13:44:51 CET 2015


On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Neil Horman
> > Sent: Friday, March 27, 2015 10:26 AM
> > To: Wiles, Keith
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> > 
> > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> > >
> > >
> > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss at linaro.org> wrote:
> > >
> > > >The current way is not the most efficient: if m->refcnt is 1, the second
> > > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> > > >condition fails again, although the code suggest otherwise to branch
> > > >prediction. Instead we should keep the second condition only, and remove
> > > >the
> > > >duplicate set to zero.
> > > >
> > > >Signed-off-by: Zoltan Kiss <zoltan.kiss at linaro.org>
> > > >---
> > > > lib/librte_mbuf/rte_mbuf.h | 5 +----
> > > > 1 file changed, 1 insertion(+), 4 deletions(-)
> > > >
> > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > >index 17ba791..3ec4024 100644
> > > >--- a/lib/librte_mbuf/rte_mbuf.h
> > > >+++ b/lib/librte_mbuf/rte_mbuf.h
> > > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > {
> > > > 	__rte_mbuf_sanity_check(m, 0);
> > > >
> > > >-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > > >-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > > >-
> > > >-		rte_mbuf_refcnt_set(m, 0);
> > > >+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > > >
> > > > 		/* if this is an indirect mbuf, then
> > > > 		 *  - detach mbuf
> > >
> > > I fell for this one too, but read Bruce¹s email
> > > http://dpdk.org/ml/archives/dev/2015-March/014481.html
> > 
> > This is still the right thing to do though, Bruce's reasoning is erroneous.
> 
> No, it is not. I believe Bruce comments is absolutely correct here.
> 
You and bruce are wrong, I proved that below.

> > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> 
> It does.
> 
assertions are meaningless without evidence.

> > are the last user of the mbuf,
> > you are only guaranteed that if the update
> > operation returns zero.
> > 
> > In other words:
> > rte_mbuf_refcnt_update(m, -1)
> > 
> > is an atomic operation
> > 
> > if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> > 
> > is not.
> > 
> > To illustrate, on two cpus, this might occur:
> > 
> > CPU0					CPU1
> > rte_mbuf_refcnt_read			...
> >    returns 1				rte_mbuf_refcnt_read
> > ...					   returns 1
> > execute if clause			execute if clause
> 
> 
> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
> it is a bug in your code.
At what point in time did I indicate this was about multiple frees?  Please
re-read my post.

> Such code wouldn't work properly doesn't matter do we use:
> 
>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
> 
> or just: 
> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
> 
> To illustrate it with your example:
> Suppose m.refcnt==1
> 
> CPU0 executes: 
> 
> rte_pktmbuf_free(m1)
>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> 
> m2 = rte_pktmbuf_alloc(pool);
>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
> 
> /* m2 refcnt ==1 start using m2 */
> 
Really missing the point here.

> CPU1 executes:
> rte_pktmbuf_free(m1)
>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> 
> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
> 
Still missing the point. Please see below

> > 
> > In the above scenario both cpus fell into the if clause because they both held a
> > pointer to the same buffer and both got a return value of one, so they skipped
> > the update portion of the if clause and both executed the internal block of the
> > conditional expression.  you might be tempted to think thats ok, since that
> > block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> > entire purpose of that if conditional above was to ensure that only one
> > execution context ever executed the conditional for a given buffer.  Look at
> > what else happens in that conditional:
> > 
> > static inline struct rte_mbuf* __attribute__((always_inline))
> > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> >         __rte_mbuf_sanity_check(m, 0);
> > 
> >         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> >                 rte_mbuf_refcnt_set(m, 0);
> > 
> >                 /* if this is an indirect mbuf, then
> >                  *  - detach mbuf
> >                  *  - free attached mbuf segment
> >                  */
> >                 if (RTE_MBUF_INDIRECT(m)) {
> >                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
> >                         rte_pktmbuf_detach(m);
> >                         if (rte_mbuf_refcnt_update(md, -1) == 0)
> >                                 __rte_mbuf_raw_free(md);
> >                 }
> >                 return(m);
> >         }
> >         return (NULL);
> > }
> > 
> > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> > and in the scenario I outlined above, that refcnt will underflow, likely causing
> > a buffer leak.  Additionally, the return code of this function is designed to
> > indicate to the caller if they were the last user of the buffer.  In the above
> > scenario, two execution contexts will be told that they were, which is wrong.
> > 
> > Zoltans patch is a good fix
> 
> I don't think so.
> 
> 
> > 
> > Acked-by: Neil Horman <nhorman at tuxdriver.com>
> 
> 
> NACKed-by: Konstantin Ananyev <konstantin.ananyev at intel.com>
> 

Again, this has nothing to do with how many times you free an object and
everything to do with why you use atomics here in the first place.  The purpose
of the if conditional in the above code is to ensure that the contents of the
conditional block only get executed a single time, correct?  Ostensibly you
don't want to execution contexts getting in there at the same time right?

If you have a single buffer with refcnt=1, and two cpus are executing code that
points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
the same time, they can race and both wind up in that conditional block, leading
to underflow of the md pointer refcnt, which is bad.

Lets look at another more practical example.  lets imagine that that the mbuf X
is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
and is about to be freed using the above refcnt test model (a read followed by
an update that gets squashed, anda refcnt set in the free block.  Basically this
pseudo code:

if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
	refcnt_set(X,0)
	mbuf_free(X)
}

at the same time CPU1 is preforming a lookup of our needed mbuf from the
aforementioned set, finds it and takes a refcnt on it.


CPU0						CPU1
if(refcnt_read(X))				search for mbuf X
     returns 1					get pointer to X
...						refcnt_update(X,1)
refcnt_set(X, 0)				...
mbuf_free(X)


After the following sequence X is freed but CPU1 is left thinking that it has a
valid reference to the mbuf.  This is broken.

As an alternate thought experiment, why use atomics here at all?  X86 is cache
coherent right?  (ignore that new processor support, as this code predates it).
If all cpus are able to see a consistent state of a variable, and if every
context that has a pointer to a given mbuf also has a reference to an mbuf, then
it should be safe to simply use an integer here rather than an atomic, right?
If you know that you have a reference to a pointer, just decrement the refcnt
and check for 0 instead of one, that will tell you that you are the last user of
a buffer, right?  The answer is you can't because there are conditions in which
you either need to make a set of conditions atomic (finding a pointer and
increasing said refcnt under the protection of a lock), or you need some method
to predicate the execution of some initial or finilazation event (like in
__rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
same init/finalization and so that you don't provide small windows of
inconsistency in your atomics, which is what you have above.

I wrote a demonstration program to illustrate (forgive me, its pretty quick and
dirty), but I think it illustrates the point:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>

atomic_uint_fast64_t refcnt;

uint threads = 0;

static void * thread_exec(void *arg)
{
        int i;
        int cpu = (int)(arg);
        cpu_set_t cpuset;
        pthread_t thread;

        thread = pthread_self();
        CPU_ZERO(&cpuset);
        CPU_SET(cpu, &cpuset);
        pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

        for (i=0; i < 1000; i++) {
                if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
                        atomic_fetch_sub(&refcnt, 1) == 0)) {
                        // There should only ever be one thread in here at a
                        atomic_init(&refcnt, 0);
                        threads |= cpu;
                        printf("threads = %d\n", threads);
                        threads &= ~cpu;

			// Need to reset the refcnt for future iterations
			// but that should be fine since no other thread
			// should be in here but us
                        atomic_init(&refcnt, 1);
                }
        }

        pthread_exit(NULL);
}

int main(int argc, char **argv)
{
        pthread_attr_t attr;
        pthread_t thread_id1, thread_id2;
        void *status;

        atomic_init(&refcnt, 1);

        pthread_attr_init(&attr);

        pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
        pthread_create(&thread_id2, &attr, thread_exec, (void *)2);

        pthread_attr_destroy(&attr);

        pthread_join(thread_id1, &status);
        pthread_join(thread_id2, &status);


        exit(0);

}


If you run this on an smp system, you'll clearly see that, occasionally the
value of threads is 3.  That indicates that you have points where you have
multiple contexts executing in that conditional block that has clearly been
coded to only expect one.  You can't make the assumption that every pointer has
a held refcount here, you need to incur the update penalty.

Neil



More information about the dev mailing list