[dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

Slava Ovsiienko viacheslavo at nvidia.com
Thu May 6 13:21:57 CEST 2021


Hi, Feifei

Sorry, I do not follow why we should get rid of the last (after dev_gen update) wmb.
We've rebuilt the global cache, we should notify other agents it's happened and
they should flush local caches. So, dev_gen change should be made visible
to other agents to trigger this activity and the second wmb is here to ensure this.

One more point, due to registering new/destroying existing MR involves FW (via kernel) calls,
it takes so many CPU cycles that we could neglect wmb overhead at all.

Also, regarding this:

 > > Another question suddenly occurred to me, in order to keep the
 > > order that rebuilding global cache before updating ”dev_gen“, the
 > > wmb should be before updating "dev_gen" rather than after it.
 > > Otherwise, in the out-of-order platforms, current order cannot be kept.

it is not clear why ordering is important - global cache update and dev_gen change 
happen under spinlock protection, so only the last wmb  is meaningful.

To summarize, in my opinion:
- if you see some issue with ordering of global cache update/dev_gen signalling,
  could you, please, elaborate? I'm not sure we should maintain an order (due to spinlock protection)
- the last rte_smp_wmb() after dev_gen incrementing should be kept intact

With best regards,
Slava

> -----Original Message-----
> From: Feifei Wang <Feifei.Wang2 at arm.com>
> Sent: Thursday, May 6, 2021 5:52
> To: Slava Ovsiienko <viacheslavo at nvidia.com>; Matan Azrad
> <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> Cc: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org; Ruifeng Wang
> <Ruifeng.Wang at arm.com>; nd <nd at arm.com>
> Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Would you have more comments about this patch?
> For my sight, only one wmb before "dev_gen" updating is enough to
> synchronize.
> 
> Thanks very much for your attention.
> 
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Feifei Wang
> > 发送时间: 2021年4月20日 16:42
> > 收件人: Slava Ovsiienko <viacheslavo at nvidia.com>; Matan Azrad
> > <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> > 抄送: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang at arm.com>; nd <nd at arm.com>
> > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Slava
> >
> > I think the second wmb can be removed.
> > As I know, wmb is just a barrier to keep the order between write and write.
> > and it cannot tell the CPU when it should commit the changes.
> >
> > It is usually used before guard variable to keep the order that
> > updating guard variable after some changes, which you want to release,
> have been done.
> >
> > For example, for the wmb  after global cache update/before altering
> > dev_gen, it can ensure the order that updating global cache before
> > altering
> > dev_gen:
> > 1)If other agent load the changed "dev_gen", it can know the global
> > cache has been updated.
> > 2)If other agents load the unchanged, "dev_gen", it means the global
> > cache has not been updated, and the local cache will not be flushed.
> >
> > As a result, we use  wmb and guard variable "dev_gen" to ensure the
> > global cache updating is "visible".
> > The "visible" means when updating guard variable "dev_gen" is known by
> > other agents, they also can confirm global cache has been updated in
> > the meanwhile. Thus, just one wmb before altering  dev_gen can ensure
> this.
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Slava Ovsiienko <viacheslavo at nvidia.com>
> > > 发送时间: 2021年4月20日 15:54
> > > 收件人: Feifei Wang <Feifei.Wang2 at arm.com>; Matan Azrad
> > > <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> > > 抄送: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang at arm.com>; nd <nd at arm.com>; nd <nd at arm.com>; nd
> > > <nd at arm.com>
> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Feifei
> > >
> > > In my opinion, there should be 2 barriers:
> > >  - after global cache update/before altering dev_gen, to ensure the
> > > correct order
> > >  - after altering dev_gen to make this change visible for other
> > > agents and to trigger local cache update
> > >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <Feifei.Wang2 at arm.com>
> > > > Sent: Tuesday, April 20, 2021 10:30
> > > > To: Slava Ovsiienko <viacheslavo at nvidia.com>; Matan Azrad
> > > > <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> > > > Cc: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org; Ruifeng
> Wang
> > > > <Ruifeng.Wang at arm.com>; nd <nd at arm.com>; nd <nd at arm.com>;
> nd
> > > > <nd at arm.com>
> > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Another question suddenly occurred to me, in order to keep the
> > > > order that rebuilding global cache before updating ”dev_gen“, the
> > > > wmb should be before updating "dev_gen" rather than after it.
> > > > Otherwise, in the out-of-order platforms, current order cannot be kept.
> > > >
> > > > Thus, we should change the code as:
> > > > a) rebuild global cache;
> > > > b) rte_smp_wmb();
> > > > c) updating dev_gen
> > > >
> > > > Best Regards
> > > > Feifei
> > > > > -----邮件原件-----
> > > > > 发件人: Feifei Wang
> > > > > 发送时间: 2021年4月20日 13:54
> > > > > 收件人: Slava Ovsiienko <viacheslavo at nvidia.com>; Matan Azrad
> > > > > <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> > > > > 抄送: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang at arm.com>; nd <nd at arm.com>; nd <nd at arm.com>
> > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region
> > > > > cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > Thanks very much for your explanation.
> > > > >
> > > > > I can understand the app can wait all mbufs are returned to the
> > > > > memory pool, and then it can free this mbufs, I agree with this.
> > > > >
> > > > > As a result, I will remove the bug fix patch from this series
> > > > > and just replace the smp barrier with C11 thread fence. Thanks
> > > > > very much for your patient explanation again.
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Slava Ovsiienko <viacheslavo at nvidia.com>
> > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > 收件人: Feifei Wang <Feifei.Wang2 at arm.com>; Matan Azrad
> > > > > > <matan at nvidia.com>; Shahaf Shuler <shahafs at nvidia.com>
> > > > > > 抄送: dev at dpdk.org; nd <nd at arm.com>; stable at dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang at arm.com>; nd <nd at arm.com>
> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > Please, see below
> > > > > >
> > > > > > ....
> > > > > >
> > > > > > > > Hi, Feifei
> > > > > > > >
> > > > > > > > Sorry, I do not follow what this patch fixes. Do we have
> > > > > > > > some issue/bug with MR cache in practice?
> > > > > > >
> > > > > > > This patch fixes the bug which is based on logical
> > > > > > > deduction, and it doesn't actually happen.
> > > > > > >
> > > > > > > >
> > > > > > > > Each Tx queue has its own dedicated "local" cache for MRs
> > > > > > > > to convert buffer address in mbufs being transmitted to
> > > > > > > > LKeys (HW-related entity
> > > > > > > > handle) and the "global" cache for all MR registered on the
> device.
> > > > > > > >
> > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > - check the local queue cache flush request
> > > > > > > > - lookup in local cache
> > > > > > > > - if not found:
> > > > > > > > - acquire lock for global cache read access
> > > > > > > > - lookup in global cache
> > > > > > > > - release lock for global cache
> > > > > > > >
> > > > > > > > How cache update on memory freeing/unregistering happens:
> > > > > > > > - acquire lock for global cache write access
> > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > - [b] set local caches flush request
> > > > > > > > - free global cache lock
> > > > > > > >
> > > > > > > > If I understand correctly, your patch swaps [a] and [b],
> > > > > > > > and local caches flush is requested earlier. What problem does it
> solve?
> > > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > > referencing to the memory being freed. Application must
> > > > > > > > ensure this and must not allocate new mbufs from this
> > > > > > > > memory regions
> > > being freed.
> > > > > > > > Hence, the lookups for these MRs in caches should not occur.
> > > > > > >
> > > > > > > For your first point that, application can take charge of
> > > > > > > preventing MR freed memory being allocated to data path.
> > > > > > >
> > > > > > > Does it means that If there is an emergency of MR fragment,
> > > > > > > such as hotplug, the application must inform thedata path in
> > > > > > > advance, and this memory will not be allocated, and then the
> > > > > > > control path will free this memory? If application  can do
> > > > > > > like this, I agree that this bug
> > > > > cannot happen.
> > > > > >
> > > > > > Actually,  this is the only correct way for application to operate.
> > > > > > Let's suppose we have some memory area that application wants
> > > > > > to
> > > free.
> > > > > > ALL references to this area must be removed. If we have some
> > > > > > mbufs allocated from this area, it means that we have memory
> > > > > > pool created
> > > > there.
> > > > > >
> > > > > > What application should do:
> > > > > > - notify all its components/agents the memory area is going to
> > > > > > be freed
> > > > > > - all components/agents free the mbufs they might own
> > > > > > - PMD might not support freeing for some mbufs (for example
> > > > > > being sent and awaiting for completion), so app should just
> > > > > > wait
> > > > > > - wait till all mbufs are returned to the memory pool (by
> > > > > > monitoring available obj == pool size)
> > > > > >
> > > > > > Otherwise - it is dangerous to free the memory. There are just
> > > > > > some mbufs still allocated, it is regardless to buf address to
> > > > > > MR translation. We just can't free the memory - the mapping
> > > > > > will be destroyed and might cause the segmentation fault by SW
> > > > > > or some HW issues on DMA access to unmapped memory.  It is
> > > > > > very generic safety approach - do not free the memory that is still in
> use.
> > > > > > Hence, at the moment of freeing and unregistering the MR,
> > > > > > there MUST BE NO any
> > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > No translation to MR being invalidated can happen.
> > > > > >
> > > > > > >
> > > > > > > > For other side, the cache flush has negative effect - the
> > > > > > > > local cache is getting empty and can't provide translation
> > > > > > > > for other valid (not being removed) MRs, and the
> > > > > > > > translation has to look up in the global cache, that is
> > > > > > > > locked now for rebuilding, this causes the delays in
> > > > > > > > datapatch
> > > > > > > on acquiring global cache lock.
> > > > > > > > So, I see some potential performance impact.
> > > > > > >
> > > > > > > If above assumption is true, we can go to your second point.
> > > > > > > I think this is a problem of the tradeoff between cache
> > > > > > > coherence and
> > > > > > performance.
> > > > > > >
> > > > > > > I can understand your meaning that though global cache has
> > > > > > > been changed, we should keep the valid MR in local cache as
> > > > > > > long as possible to ensure the fast searching speed.
> > > > > > > In the meanwhile, the local cache can be rebuilt later to
> > > > > > > reduce its waiting time for acquiring the global cache lock.
> > > > > > >
> > > > > > > However,  this mechanism just ensures the performance
> > > > > > > unchanged for the first few mbufs.
> > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > Thus, no matter whether there is this patch or not,  the
> > > > > > > performance will jitter in a certain period of
> > > > time.
> > > > > >
> > > > > > Local cache should be updated to remove MRs no longer valid.
> > > > > > But we just flush the entire cache.
> > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in
> > > > > > local
> > cache.
> > > > > > And there are traffic in the datapath for MR0 and MR1, and no
> > > > > > traffic for MRX anymore.
> > > > > >
> > > > > > 1) If we do as you propose:
> > > > > > a) take a lock
> > > > > > b) request flush local cache first - all MR0, MR1, MRX will be
> > > > > > removed on translation in datapath
> > > > > > c) update global cache,
> > > > > > d) free lock
> > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on
> > > > > > lock taken for cache update since point b) till point d).
> > > > > >
> > > > > > 2) If we do as it is implemented now:
> > > > > > a) take a lock
> > > > > > b) update global cache
> > > > > > c) request flush local cache
> > > > > > d) free lock
> > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in local
> > > > > > cache (not happens for MR0 and MR1, must not happen for MRX),
> > > > > > and probability should be minor. And lock might happen since
> > > > > > c) till
> > > > > > d)
> > > > > > - quite short period of time
> > > > > >
> > > > > > Summary, the difference between 1) and 2)
> > > > > >
> > > > > > Lock probability:
> > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > >
> > > > > > Lock duration:
> > > > > > - 1) lock since b) till d),
> > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > >
> > > > > > >
> > > > > > > Finally, in conclusion, I tend to think that the bottom
> > > > > > > layer can do more things to ensure the correct execution of
> > > > > > > the program, which may have a negative impact on the
> > > > > > > performance in a short time, but in the long run, the
> > > > > > > performance will eventually
> > > come back.
> > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > performance in the stable period, and try our best to ensure
> > > > > > > the correctness of the program in case of
> > > > > > emergencies.
> > > > > >
> > > > > > If we have some mbufs still allocated in memory being freed -
> > > > > > there is nothing to say about correctness, it is totally
> > > > > > incorrect. In my opinion, we should not think how to mitigate
> > > > > > this incorrect behavior, we should not encourage application
> > > > > > developers to follow the wrong
> > > > > approaches.
> > > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > With best regards,
> > > > > > > > Slava
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Feifei Wang <feifei.wang2 at arm.com>
> > > > > > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > > > > > To: Matan Azrad <matan at nvidia.com>; Shahaf Shuler
> > > > > > > > > <shahafs at nvidia.com>; Slava Ovsiienko
> > > > > > > > > <viacheslavo at nvidia.com>; Yongseok Koh
> > > > > > > > > <yskoh at mellanox.com>
> > > > > > > > > Cc: dev at dpdk.org; nd at arm.com; Feifei Wang
> > > > > > <feifei.wang2 at arm.com>;
> > > > > > > > > stable at dpdk.org; Ruifeng Wang <ruifeng.wang at arm.com>
> > > > > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory Region cache
> > > > > > > > >
> > > > > > > > > 'dev_gen' is a variable to inform other cores to flush
> > > > > > > > > their local cache when global cache is rebuilt.
> > > > > > > > >
> > > > > > > > > However, if 'dev_gen' is updated after global cache is
> > > > > > > > > rebuilt, other cores may load a wrong memory region lkey
> > > > > > > > > value from old local
> > > > > > > cache.
> > > > > > > > >
> > > > > > > > > Timeslot        main core               worker core
> > > > > > > > >   1         rebuild global cache
> > > > > > > > >   2                                  load unchanged dev_gen
> > > > > > > > >   3            update dev_gen
> > > > > > > > >   4                                  look up old local cache
> > > > > > > > >
> > > > > > > > > From the example above, we can see that though global
> > > > > > > > > cache is rebuilt, due to that dev_gen is not updated,
> > > > > > > > > the worker core may look up old cache table and receive
> > > > > > > > > a wrong memory region
> > > > lkey value.
> > > > > > > > >
> > > > > > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > > > > > rebuilding global cache to inform worker cores to flush
> > > > > > > > > their local cache when global cache start rebuilding.
> > > > > > > > > And wmb can ensure the sequence of this
> > > > > > > > process.
> > > > > > > > >
> > > > > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region
> > > > > > > > > support")
> > > > > > > > > Cc: stable at dpdk.org
> > > > > > > > >
> > > > > > > > > Suggested-by: Ruifeng Wang <ruifeng.wang at arm.com>
> > > > > > > > > Signed-off-by: Feifei Wang <feifei.wang2 at arm.com>
> > > > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang at arm.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > > > > > +++++++++++++++++--------------------
> > > > > > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > > > > > da4e91fc2..7ce1d3e64 100644
> > > > > > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > @@ -103,20 +103,18 @@
> mlx5_mr_mem_event_free_cb(struct
> > > > > > > > > mlx5_dev_ctx_shared *sh,
> > > > > > > > >  		rebuild = 1;
> > > > > > > > >  	}
> > > > > > > > >  	if (rebuild) {
> > > > > > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > > +		++sh->share_cache.dev_gen;
> > > > > > > > > +		DEBUG("broadcasting local cache flush,
> gen=%d",
> > > > > > > > > +			sh->share_cache.dev_gen);
> > > > > > > > > +
> > > > > > > > >  		/*
> > > > > > > > >  		 * Flush local caches by propagating invalidation
> > > > > across cores.
> > > > > > > > > -		 * rte_smp_wmb() is enough to synchronize this
> > > > > event. If
> > > > > > > > > one of
> > > > > > > > > -		 * freed memsegs is seen by other core, that means
> > > > > the
> > > > > > > > > memseg
> > > > > > > > > -		 * has been allocated by allocator, which will come
> > > > > after this
> > > > > > > > > -		 * free call. Therefore, this store instruction
> > > > > (incrementing
> > > > > > > > > -		 * generation below) will be guaranteed to be seen
> > > > > by other
> > > > > > > > > core
> > > > > > > > > -		 * before the core sees the newly allocated memory.
> > > > > > > > > +		 * rte_smp_wmb() is to keep the order that
> dev_gen
> > > > > > > > > updated before
> > > > > > > > > +		 * rebuilding global cache. Therefore, other
> core can
> > > > > flush
> > > > > > > > > their
> > > > > > > > > +		 * local cache on time.
> > > > > > > > >  		 */
> > > > > > > > > -		++sh->share_cache.dev_gen;
> > > > > > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > -		      sh->share_cache.dev_gen);
> > > > > > > > >  		rte_smp_wmb();
> > > > > > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > >  	}
> > > > > > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > > > > > >  }
> > > > > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct
> > rte_pci_device
> > > > > > > *pdev,
> > > > > > > > void
> > > > > > > > > *addr,
> > > > > > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > > > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> > > > > >port_id,
> > > > > > > > >  	      (void *)mr);
> > > > > > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > > +
> > > > > > > > > +	++sh->share_cache.dev_gen;
> > > > > > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > +		sh->share_cache.dev_gen);
> > > > > > > > > +
> > > > > > > > >  	/*
> > > > > > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> > > > > one of
> > > > > > > > > -	 * freed memsegs is seen by other core, that means the
> > > > > memseg
> > > > > > > > > -	 * has been allocated by allocator, which will come after this
> > > > > > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > > > > > -	 * generation below) will be guaranteed to be seen by other
> > > > > core
> > > > > > > > > -	 * before the core sees the newly allocated memory.
> > > > > > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > updated
> > > > > > > > > before
> > > > > > > > > +	 * rebuilding global cache. Therefore, other core can
> > > > > > > > > +flush
> > > > > their
> > > > > > > > > +	 * local cache on time.
> > > > > > > > >  	 */
> > > > > > > > > -	++sh->share_cache.dev_gen;
> > > > > > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > -	      sh->share_cache.dev_gen);
> > > > > > > > >  	rte_smp_wmb();
> > > > > > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > > > > > >  	return 0;
> > > > > > > > >  }
> > > > > > > > > --
> > > > > > > > > 2.25.1



More information about the dev mailing list