[dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add

Long Li longli at microsoft.com
Tue Oct 13 19:14:32 CEST 2020


>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while
>handling sub-device add
>
>On 09/10/20 20:30 +0000, Long Li wrote:
>> >Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error
>> >code while handling sub-device add
>> >
>> >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
>> >> Hi,
>> >>
>> >> On 02/10/20 17:01 -0700, Long Li wrote:
>> >> > From: Long Li <longli at microsoft.com>
>> >> >
>> >> > When adding a sub-device, it's possible that the sub-device is
>> >> > configured successfully but later fails to start. This error
>> >> > should not be
>> >masked.
>> >>
>> >> Some of those errors are meant to be masked: -EIO, when the device
>> >> is marked as removed at the ethdev level (see eth_err() in
>rte_ethdev.c:819).
>> >>
>> >> > The driver needs to check the error status to prevent endless
>> >> > loop of trying to start the sub-device.
>> >>
>> >> If the ethdev layer error is due to the device being removed, and
>> >> failsafe loops on trying to sync the eth device to its own state,
>> >> then an RMV event should have been emitted but wasn't or it was
>> >> missed by failsafe.
>> >>
>> >> If the ethdev layer error is *not* due to the device being removed,
>> >> the error should be != -EIO, and sdev->remove should not be set, so
>> >> fs_err() should not mask it and it should be seen by the app.
>> >>
>> >> Can you provide the following details:
>> >>
>> >>  * What is the return code of rte_eth_dev_start() that is masked in your
>> >>    start loop?
>> >>
>> >>  * Is the device marked as removed in failsafe?
>> >>
>> >>  * Is the device marked as removed in ethdev?
>> >>
>> >>  * Was there an RMV event generated for the device? Whether yes or
>no,
>> >>    is it correct?
>> >>
>> >> Thanks,
>> >>
>> >
>> >Hello Li,
>> >
>> >I've found the previous mail thread [1] where you described how you
>> >got this error. In your description, you say that you try unplug then
>> >quick replug, before any event is processed?
>>
>> Hi Gaëtan,
>>
>> Sorry for getting back late. I had trouble with my email.
>>
>> I think the issue is that: when the failsafe driver tries to start the sub device,
>it may fail. The failsafe driver needs to deal with the failure. Here is another
>log that I captured:
>>
>> net_failsafe: Starting sub_device 0
>> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation
>> failure: Cannot allocate memory
>> net_mlx4: cannot initialize common RSS resources (queue 0): unable to
>> create Rx queue resources: Cannot allocate memory
>> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate
>> memory
>> net_failsafe: Starting sub_device 0
>> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation
>> failure: Cannot allocate memory
>> net_mlx4: cannot initialize common RSS resources (queue 0): unable to
>> create Rx queue resources: Cannot allocate memory
>> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate
>> memory
>>
>> It's called from fs_dev_start(). The MLX4 can't start, it's in a state being
>removed in ethdev. But this error is being masked by fs_err():
>>
>> static inline int
>> fs_err(struct sub_device *sdev, int err) {
>>         /* A device removal shouldn't be reported as an error. */
>>         if (sdev->remove == 1 || err == -EIO)
>>                 return rte_errno = 0;
>>         return err;
>> }
>>
>> So fs_dev_start() always return a success, returned to
>failsafe_eth_dev_state_sync(), which is repeated called from the alarm
>fs_hotplug_alarm(). This goes to an endless loop.
>>
>> I believe failsafe_eth_rmv_event_callback() has been called prior to this. It
>sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always
>returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-
>alarming itself.
>
>Ok, the issue here is that even with sdev->remove == 1, the device is not
>actually removed, but it should.
>
>I think we discussed it with Stephen off-list not too long ago actually.
>In his case the actual bug was in vdev_netvsc, but there was still this issue in
>failsafe. It is happening there:
>
>   void
>   failsafe_dev_remove(struct rte_eth_dev *dev)
>   {
>           struct sub_device *sdev;
>           uint8_t i;
>
>           FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE)
>                                         /* ^^^^^^^^^^^^ HERE is the bug */
>                   if (sdev->remove && fs_rxtx_clean(sdev)) {
>                           if (fs_lock(dev, 1) != 0)
>                                   return;
>                           fs_dev_stats_save(sdev);
>                           fs_dev_remove(sdev);
>                           fs_unlock(dev, 1);
>                   }
>   }
>
>A device with the remove flag will be cleaned up only if it is already in an active
>state. This is not correct, it should be removed in all cases -- even if it stayed
>stuck in PROBED state.
>
>Here is a fix I propose for this issue:
>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails.
>dpdk.org%2Farchives%2Fdev%2F2020-
>October%2F185921.html&data=04%7C01%7Clongli%40microsoft.com%7C
>0cfccf19cf3d42ee549008d86eba3d6e%7C72f988bf86f141af91ab2d7cd011db47
>%7C1%7C0%7C637381093902558861%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
>iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300
>0&sdata=j13foFEbNV7DqWxl2ahLloEZSbCO96lmPaqSfjU5KsQ%3D&r
>eserved=0
>
>Could you please test it and verify it solves your issue?

Thank you!

I have been testing this patch for 24 hours and it worked well.

>
>>
>> Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev-
>>remove is set, this will always return 0, regardless of err.
>>
>
>if sdev->remove == 1, there should not even be a call to sub-device start ops
>actually. The device is flagged for removal, next thing to happen is to remove
>it, even if the device reappeared on the bus in-between. The failsafe needs to
>consume the removed flag (close the device then remove it), then it will be
>ready to detect the new device and will probe it back from the beginning.
>
>> It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb.
>>
>> Thanks,
>>
>> Long
>>
>> >
>> >If that's the case, it seems a clear race condition, and an issue of
>> >missing the removal event of the device. I would not say yet that the
>> >bug is in failsafe, but it could be in ethdev.
>> >
>> >Can you please check whether the device removal event was properly
>> >generated in rte_ethdev? Failsafe (and any other hotplug support
>> >layer
>> >actually) will depend on it so it should be first checked to work.
>> >
>> >Thanks,
>> >
>> >[1]:
>>
>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails
>.
>> >dpdk.org%2Farchives%2Fdev%2F2020-
>>
>>September%2F182977.html&data=02%7C01%7Clongli%40microsoft.co
>m
>>
>>%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011
>d
>>
>>b47%7C1%7C0%7C637378572568404734&sdata=DAUVxBJQtJx8mUSWlP
>I
>> >DtenhpMmdiMPpIczmxOdrCqE%3D&reserved=0
>> >
>> >--
>> >Gaëtan
>
>Regards,
>--
>Gaëtan


More information about the dev mailing list