[dpdk-stable] [dpdk-dev] [PATCH] net/mlx5: fix Tx doorbell memory barrier
yskoh at mellanox.com
Mon Oct 23 00:01:04 CEST 2017
On Sun, Oct 22, 2017 at 12:46:53PM +0300, Sagi Grimberg wrote:
> > Configuring UAR as IO-mapped makes maximum throughput decline by noticeable
> > amount. If UAR is configured as write-combining register, a write memory
> > barrier is needed on ringing a doorbell. rte_wmb() is mostly effective when
> > the size of a burst is comparatively small.
> Personally I don't think that the flag is really a good interface
> choice. But also I'm not convinced that its dependent on the burst size.
> What guarantees that even for larger bursts the mmio write was flushed?
> it comes after a set of writes that were flushed prior to the db update
> and its not guaranteed that the application will immediately have more
> data to trigger this writes to flush.
Yes, I already knew the concern. I don't know you were aware but that can only
happen when burst size is exactly multiple of 32 in the vectorized Tx. If you
look at the mlx5_tx_burst_raw_vec(), every Tx bursts having more than 32 packets
will be calling txq_burst_v() more than one time. For example, if pkts_n is 45,
then it will firstly call txq_burst_v(32) and txq_burst_v(13) will follow with
setting barrier at the end. The only pitfall is when pkts_n is exactly multiple
of 32, e.g. 32, 64, 96 and so on. This shall not be likely when an app is
forwarding packets and the rate is low (if packet rate is high, we are good).
So, the only possible case of it is when an app generates traffic at
comparatively low rate in bursty way with burst size being multiple of 32. If a
user encounters such a rare case and latency is critical in their app, we will
recommend to set MLX5_SHUT_UP_BF=1 either by exporting in a shell or by
embedding it in their app's initialization. Or, they can use other
non-vectorized tx_burst() functions because the barrier is still enforced in
such functions like you firstly suggested.
It is always true that we can't make everyone satisfied. Some apps prefers
better performance to better latency. As vectorized Tx outperforms all the other
tx_burst() functions, I want to leave it as only one exceptional case. Actually,
we already received a complaint that 1-core performance of vPMD declined by 10%
(53Mpps -> 48Mpps) due to the patch (MLX5_SHUT_UP_BF=1). So, I wanted to give
users/apps more versatile options/knobs.
Before sending out this patch, I've done RFC2544 latency tests with Ixia and the
result was as good as before (actually same). That's why we think it is a good
Thanks for your comment,
More information about the stable