[dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call

Yongseok Koh yskoh at mellanox.com
Tue Jul 25 09:43:57 CEST 2017


On Sun, Jul 23, 2017 at 12:49:36PM +0300, Sagi Grimberg wrote:
> > > > mlx5_tx_complete() polls completion queue multiple times until it
> > > > encounters an invalid entry. As Tx completions are suppressed by
> > > > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> > > > in a poll. And freeing too many buffers in a call can cause high jitter.
> > > > This patch improves throughput a little.
> > > 
> > > What if the device generates burst of completions?
> > mlx5 PMD suppresses completions anyway. It requests a completion per every
> > MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
> > queue is even much small.
> 
> Yes I realize that, but can't the device still complete in a burst (of
> unsuppressed completions)? I mean its not guaranteed that for every
> txq_complete a signaled completion is pending right? What happens if
> the device has inconsistent completion pacing? Can't the sw grow a
> batch of completions if txq_complete will process a single completion
> unconditionally?
Speculation. First of all, device doesn't delay completion notifications for no
reason. ASIC is not a SW running on top of a OS. If a completion comes up late,
this means device really can't keep up the rate of posting descriptors. If so,
tx_burst() should generate back-pressure by returning partial Tx, then app can
make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
if app is forwarding packets.

More serious problem I expected was a case that the THRESH is smaller than
burst size. In that case, txq->elts[] will be short of slots all the time. But
fortunately, in MLX PMD, we request one completion per a burst at most, not
every THRESH of packets.

If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
Question to myself was "when does it shrink?". It shrinks when Tx burst is light
(burst size is smaller than THRESH) because mlx5_tx_complete() is always called
every time tx_burst() is called. What if it keeps growing? Then, drop is
necessary and natural like I mentioned above.

It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
It is usually done by increasing queue depth. Keeping steady state is more
important. 

Rather, this patch is helpful for reducing jitters. When I run a profiler, the
most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
checking valid CQE, many buffers could be freed in a single call of
mlx5_tx_complete() at some moment, then it would cause a long delay. This would
aggravate jitter.

> > > Holding these completions un-reaped can theoretically cause resource stress on
> > > the corresponding mempool(s).
> > Can you make your point clearer? Do you think the "stress" can impact
> > performance? I think stress doesn't matter unless it is depleted. And app is
> > responsible for supplying enough mbufs considering the depth of all queues (max
> > # of outstanding mbufs).
> 
> I might be missing something, but # of outstanding mbufs should be
> relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
> Why should the pool account for the entire TX queue depth (which can
> be very large)?
Reason is simple for Rx queue. If the number of mbufs in the provisioned mempool
is less then rxq depth, PMD can't even successfully initialize device. PMD
doesn't keep a private mempool. So, it is nonsensical to provision less mbufs
than queue depth even if it isn't documented. It is obvious.

No mempool is assigned for Tx. And in this case, app isn't forced to prepare
enough mbufs to cover all the Tx queues. But the downside of it is significant
performance degradation. From PMD perspective, it just needs to avoid any
deadlock condition due to depletion. Even if freeing mbufs in bulk causes some
resource depletion in app side, it is a fair trade-off to get higher performance
unless there's no deadlock. And as far as I can tell, most of PMDs would free
mbufs in bulk, not one by one. Also good for cache locality.

Anyway, there are many examples according to packet processing mode -
fwd/rxonly/txonly. But I won't explain all of them one by one.

> Is there a hard requirement documented somewhere that the application
> needs to account for the entire TX queue depths for sizing its mbuf
> pool?
If needed, we should document it and this will be a good start for you to
contribute to DPDK community. But, think about the definition of Tx queue depth,
doesn't it mean that a queue can hold that amount of descriptors? Then app
should prepare more mbufs than the queue depth which is configured by the app.
In my understanding, there's no point of having less mbufs than the total amount
of queue entries. If resource is scarce, what's the point of having larger queue
depth? It should have smaller queue.

> My question is with the proposed change, doesn't this mean that the
> application might need to allocate a bigger TX mbuf pool? Because the
> pmd can theoretically consume completions slower (as in multiple TX
> burst calls)?
No. Explained above.

[...]
> > > Perhaps an adaptive budget (based on online stats) perform better?
> > Please bring up any suggestion or submit a patch if any.
> 
> I was simply providing a review for the patch. I don't have the time
> to come up with a better patch unfortunately, but I still think its
> fair to raise a point.
Of course. I appreciate your time for the review. And keep in mind that nothing
is impossible in an open source community. I always like to discuss about ideas
with anyone. But I was just asking to hear more details about your suggestion if
you wanted me to implement it, rather than giving me one-sentence question :-)

> > Does "budget" mean the
> > threshold? If so, calculation of stats for adaptive threshold can impact single
> > core performance. With multiple cores, adjusting threshold doesn't affect much.
> 
> If you look at mlx5e driver in the kernel, it maintains online stats on
> its RX and TX queues. It maintain these stats mostly for adaptive
> interrupt moderation control (but not only).
> 
> I was suggesting maintaining per TX queue stats on average completions
> consumed for each TX burst call, and adjust the stopping condition
> according to a calculated stat.
In case of interrupt mitigation, it could be beneficial because interrupt
handling cost is too costly. But, the beauty of DPDK is polling, isn't it?


And please remember to ack at the end of this discussion if you are okay so that
this patch can gets merged. One data point is, single core performance (fwd) of
vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Thanks for your review again.

Yongseok


More information about the dev mailing list