[dpdk-dev] [PATCH] eal: fix to set the rte_device ptr's device args before hotplug

Gaetan Rivet grive at u256.net
Mon Feb 10 16:00:15 CET 2020


On 10/02/2020 15:34, Gaetan Rivet wrote:
> Hi Somnath,
> 
> Reformating the mails, to keep with the inner-posting (mixing top-posting and inner-posting
> makes it hard to follow). See the end of the mail.
> 
> 

[...]

>> Any updates on this ?
>> My thoughts on this are just as I'd suspected / suggested earlier the
>> parent device (06:02:0) gets probed once
>> during ovs-vswitchd bring up right after these 2 cmds
>>
>> ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
>> # Now start ovs-vswitchd (--no-ovsdb-server, since it is already started)
>> ovs-ctl --no-ovsdb-server --db-sock="$DB_SOCK" start
>>
>> After this point the pci probe was invoked on the parent device, but
>> without any devargs hence it is NULL.
>>
>> Now when the 'ovs add-port' cmd is issued to add a representor port
>> for example say 06:02:01 (using 06:02:00 as parent device)
>> ovs-vsctl add-port ovsbr0 vfrep1 -- set Interface vfrep1 type=dpdk
>> options:dpdk-devargs=0000:06:02.0,representor=[1]
>>
>> What happens is
>>
>> *before* pci_name_set()
>> p dev
>> $1 = (struct rte_pci_device *) 0x55bd6b0
>> (gdb) p /x *dev
>> $2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, device = {next =
>> {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x0, driver = 0x0, bus =
>> 0x3a59a00, numa_node = 0x0,
>>      devargs = 0x0}, addr = {domain = 0x0, bus = 0x6, devid = 0x2,
>> function = 0x0}.......}
>>
>> *after* pci_name_set()
>>
>> p /x *dev
>> $3 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, device = {next =
>> {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x0, driver = 0x0, bus =
>> 0x3a59a00, numa_node = 0x0,
>>      devargs = 0x55a4ae0},addr = {domain = 0x0, bus = 0x6, devid = 0x2,
>> function = 0x0}......}
>>
>> As you can see, 'devargs' for the pci_device is now populated ..
>>
>> But if you go further down in pci_scan_one()
>>
>> pci_scan_one (dirname=0x7ffea330a5e0
>> "/sys/bus/pci/devices/0000:06:02.0", addr=0x7ffea330a5d0)
>>      at /root/dpdk-int_nxt/drivers/bus/pci/linux/pci.c:351
>> 351                                     if (!rte_dev_is_probed(&dev2->device)) {
>> (gdb) p dev2
>> $5 = (struct rte_pci_device *) 0x54de5e0
>> (gdb) p /x *dev2
>> $6 = {next = {tqe_next = 0x5307460, tqe_prev = 0x5539f80}, device =
>> {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x54e4f00, driver =
>> 0x3a6b7f0,
>>      bus = 0x3a59a00, numa_node = 0x0, devargs = 0x0}, addr = {domain =
>> 0x0, bus = 0x6, devid = 0x2, function = 0x0}
>>
>> we hit the pci_device ( 0x54de5e0) that was created during the first
>> time probe of the parent device ?
>> Since it is already probed, it goes to line 381 where it does frees
>> the just allocated 'pci_device'  inside pci_scan_one() via 'free(dev)'
>>
>> As you can see this pci_device does not have it's devargs set (rightly
>> so as initially there were no devargs for this device)
>>
>> But as shown in the stack trace in my previous reply, when
>> pci_find_device() walks the rte_pci_devices list , it finds this
>> earlier probed device (without devargs)
>>
> 
> Alright, I think you are right, this is what is happening.
> The issue IMO, is that the PCI scan is thus hitting an already existing device, but we have
> missed the case where the new device has more info that the previous one (linked devargs).
> 
>> pci_find_device (start=0x0, cmp=0x4b92d8 <cmp_dev_name>,
>> data=0x55a4af8) at /root/dpdk-19.11/drivers/bus/pci/pci_common.c:426
>> 426                             return &pdev->device;
>> (gdb) p pdev
>> $15 = (struct rte_pci_device *) 0x54de5e0
>>
>> (gdb) p /x *pdev
>> $16 = {next = {tqe_next = 0x5307460, tqe_prev = 0x5539f80}, device =
>> {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x54e4f00, driver =
>> 0x3a6b7f0,
>>      bus = 0x3a59a00, numa_node = 0x0, devargs = 0x0}, addr = {domain =
>> 0x0, bus = 0x6, devid = 0x2, function = 0x0},
>>
>> Hope this explains why the pci_device has NULL devargs at this point
>> and how my fix to set it at this point helps solve the problem?
>>
>> Please let me know your thoughts
>>
> 
> I think the proper fix is instead to have a clean update during the PCI scan.
> The proper way is to keep the old device, but override its fields as new info was found.
> 
> Something like calling pci_name_set(dev2); line 359 maybe. BSD should also be updated for consistency.
> 
> My issue with your patch is that I think this issue is very specific to the PCI bus and the capabilities
> of some devices on it. It would be better to have a fully compliant scan + probe ops considering
> the supported capabilities, instead of forcing this on all buses.
> 
> I'm wondering whether this can happen with an already existing devargs? If so, is it more correct to keep
> the new one, or ignore the probe, free the new devargs and report an error? If the former,
> please clean up properly the devargs using rte_devargs_remove() (before calling pci_name_set() of course).

Unfortunately, this could probably happen with some devargs. In particular, the rte_devargs_insert() function will test whether devargs name are matching before inserting a new devargs. There is an ambiguity with the BDF / DBDF notation for the PCI devices in particular, which could lead to devargs already existing for some devices, but with a new one being inserted.

The fix would be to have the bus parse the device, and compare the binary blob resulting from this parsing.
The current bus->parse() API might help, but is currently too limited (no way to know how many bytes long the result is).

This is fastidious to have to write this because some leeway was allowed to the PCI back in the day. For now at least, mind this possibility when writing your fix, those above considerations could maybe be worked out with future API evolutions.


More information about the dev mailing list