[dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call

Sagi Grimberg sagi at grimberg.me
Thu Jul 27 13:12:41 CEST 2017


>> Yes I realize that, but can't the device still complete in a burst (of
>> unsuppressed completions)? I mean its not guaranteed that for every
>> txq_complete a signaled completion is pending right? What happens if
>> the device has inconsistent completion pacing? Can't the sw grow a
>> batch of completions if txq_complete will process a single completion
>> unconditionally?
> Speculation. First of all, device doesn't delay completion notifications for no
> reason. ASIC is not a SW running on top of a OS.

I'm sorry but this statement is not correct. It might be correct in a
lab environment, but in practice, there are lots of things that can
affect the device timing.

> If a completion comes up late,
> this means device really can't keep up the rate of posting descriptors. If so,
> tx_burst() should generate back-pressure by returning partial Tx, then app can
> make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
> if app is forwarding packets.

Not arguing on that, I was simply suggesting that better heuristics
could be applied than "process one completion unconditionally".

> More serious problem I expected was a case that the THRESH is smaller than
> burst size. In that case, txq->elts[] will be short of slots all the time. But
> fortunately, in MLX PMD, we request one completion per a burst at most, not
> every THRESH of packets.
> 
> If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
> Question to myself was "when does it shrink?". It shrinks when Tx burst is light
> (burst size is smaller than THRESH) because mlx5_tx_complete() is always called
> every time tx_burst() is called. What if it keeps growing? Then, drop is
> necessary and natural like I mentioned above.
> 
> It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
> It is usually done by increasing queue depth. Keeping steady state is more
> important.

Again, I agree jitters are bad, but with proper heuristics in place mlx5
can still keep a low jitter _and_ consume completions faster than
consecutive tx_burst invocations.

> Rather, this patch is helpful for reducing jitters. When I run a profiler, the
> most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
> checking valid CQE, many buffers could be freed in a single call of
> mlx5_tx_complete() at some moment, then it would cause a long delay. This would
> aggravate jitter.

I didn't argue the fact that this patch addresses an issue, but mlx5 is
a driver that is designed to run applications that can act differently
than your test case.

> Of course. I appreciate your time for the review. And keep in mind that nothing
> is impossible in an open source community. I always like to discuss about ideas
> with anyone. But I was just asking to hear more details about your suggestion if
> you wanted me to implement it, rather than giving me one-sentence question :-)

Good to know.

>>> Does "budget" mean the
>>> threshold? If so, calculation of stats for adaptive threshold can impact single
>>> core performance. With multiple cores, adjusting threshold doesn't affect much.
>>
>> If you look at mlx5e driver in the kernel, it maintains online stats on
>> its RX and TX queues. It maintain these stats mostly for adaptive
>> interrupt moderation control (but not only).
>>
>> I was suggesting maintaining per TX queue stats on average completions
>> consumed for each TX burst call, and adjust the stopping condition
>> according to a calculated stat.
> In case of interrupt mitigation, it could be beneficial because interrupt
> handling cost is too costly. But, the beauty of DPDK is polling, isn't it?

If you read again my comment, I didn't suggest to apply stats for
interrupt moderation, I just gave an example of a use-case. I was
suggesting to maintain the online stats for adjusting a threshold of
how much completions to process in a tx burst call (instead of
processing one unconditionally).

> And please remember to ack at the end of this discussion if you are okay so that
> this patch can gets merged. One data point is, single core performance (fwd) of
> vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Yea, I don't mind merging it in given that I don't have time to come
up with anything better (or worse :))

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>


More information about the dev mailing list