[dpdk-dev] [PATCH] net/mlx5: fix link status initialization

Nélio Laranjeiro nelio.laranjeiro at 6wind.com
Mon Apr 9 15:26:41 CEST 2018


On Mon, Apr 09, 2018 at 12:28:04PM +0000, Shahaf Shuler wrote:
> Monday, April 9, 2018 11:28 AM, Nélio Laranjeiro:
> > Subject: Re: [PATCH] net/mlx5: fix link status initialization
> > 
> > On Sun, Apr 08, 2018 at 01:09:27PM +0000, Shahaf Shuler wrote:
> > > Thursday, April 5, 2018 9:51 AM, Nélio Laranjeiro:
> > > > Subject: Re: [PATCH] net/mlx5: fix link status initialization
> > > >
> > > > On Thu, Apr 05, 2018 at 05:35:57AM +0000, Shahaf Shuler wrote:
> > > > > Wednesday, April 4, 2018 3:11 PM, Nélio Laranjeiro:
> > > > > > Subject: Re: [PATCH] net/mlx5: fix link status initialization
> > > > > >
> > > > > > On Wed, Apr 04, 2018 at 09:58:33AM +0000, Shahaf Shuler wrote:
> > > > > > > Wednesday, April 4, 2018 10:30 AM, Nélio Laranjeiro:
> > > > > > > > Subject: Re: [PATCH] net/mlx5: fix link status
> > > > > > > > initialization
> > > > > > > >
> > > > > > > > On Tue, Apr 03, 2018 at 07:48:17AM +0300, Shahaf Shuler wrote:
> > >
> > > [..]
> > >
> > > > > >
> > > > > > According to your analysis, this is only necessary when the LCS
> > > > > > is configured in the device.  Why not adding this call to
> > > > > > mlx5_dev_interrupt_handler_install() which is responsible to
> > > > > > install the LCS callback.
> > > > >
> > > > > I think it is good practice whether or not LSC is set.
> > > > > The link status should be initialized to the correct value after the probe.
> > > >
> > > > There is no guarantee the link will be accurate, at probe time the
> > > > link may be up so internal information has a status up with a speed with
> > this patch.
> > > > The application probes a second port, in the mean time the link of
> > > > the first port goes down, the interrupt is still not installed and
> > > > the internal status becomes wrong (still up whereas the port is down).
> > > >
> > > > Finally at start, the device installs the handler, but the link is
> > > > still down whereas internally it is up, the application will call
> > > > rte_eth_link_get_nowait() which will directly copy the wrong
> > > > internal status to the application.
> > >
> > > This is not correct.
> > > Using Verbs, the async_fd on which link status interrupts are reported is
> > created on probe.
> > > Even if the interrupt handler is not installed, interrupts still
> > > trigger on this fd. They will be processed when the interrupt handler
> > > will be installed as part of the port start.
> > > So in fact you have the whole trace on the link status changes waiting
> > > to be processed upon port start.
> > 
> > Right, but in such case, this patch still does not solves the issue.
> > Until the dev_start() is called, the link may be inconsistent with the real
> > status.
> > 
> > example:
> > 
> >  pci_probe --> link is up.
> >  leaving pci probe, the link goes downs --> internally the PMD has a link up.
> > 
> > Until the dev_start() is called any call to rte_eth_link_get_nowait() will copy
> > the internal PMD status which is not accurate.
> > 
> > From this point, the issue seems to fully come from
> > rte_eth_link_get_nowait() which should not make any copy until the port is
> > not started, until then the link may be inconsistent between the driver and
> > the device.
> 
> Right. However fixing it only on the ethdev layer is not enough.
>
> Assume the link was up from the beginning.
> For the case were the call for rte_eth_link_get_nowait happens only
> after the port start, the link will still be wrong as it will be
> reported as down (and no interrupt to fix it). 
>
> So to summarize I plan to have this patch (with modification of the
> wait_to_complete) along with another fix for ethdev. 
> Do we have an agreement? 

Sorry but no, as MLX5 is not the only PMD impacted by this issue and a
solution can be done for all of them (ixgbe, virtio, MLX5 I did not
check the others).

Your proposition does not solves the following case neither, when the
link goes down and the application stops the port due to that, if the
application waits using the same API to restart the port, it will never
occur as the interrupts handlers are no more installed and thus the
internal link will never be updated remaining indefinitely down.  This
also impact the three PMD above.

Commit b77d21cc2364 ("ethdev: add link status get/set helper functions")
has been integrated a little too fast introducing this bug.

All those issues can be fixed by adding another statement in the if of
the function rte_eth_link_get_nowait()

-       if (dev->data->dev_conf.intr_conf.lsc)
+       if (dev->data->dev_conf.intr_conf.lsc && dev->data->dev_started)
                rte_eth_linkstatus_get(dev, eth_link);
        else {
                RTE_FUNC_PTR_OR_RET(*dev->dev_ops->link_update);
                (*dev->dev_ops->link_update)(dev, 1);
                *eth_link = dev->data->dev_link;
        }

Doing this fixes all PMD having the same bug, and remove your specific
patch for MLX5 which becomes useless.

Regards,

-- 
Nélio Laranjeiro
6WIND


More information about the dev mailing list