[dpdk-dev] [PATCH v2 3/4] net/failsafe: replace local sub-device with shared data

Gaëtan Rivet gaetan.rivet at 6wind.com
Wed Mar 6 11:46:32 CET 2019


On Tue, Mar 05, 2019 at 06:58:04PM +0100, Thomas Monjalon wrote:
> Hi,
> 
> 05/03/2019 18:38, Gaëtan Rivet:
> > >  fs_dev_remove(struct sub_device *sdev)
> [...]
> > > -		rte_eth_dev_close(PORT_ID(sdev));
> > > +		rte_eth_dev_close(edev->data->port_id);
> > 
> > Ok I see. I missed that during the first reading, the private_data is
> > zeroed on dev_close(), so ETH(sdev) becomes invalid here.
> 
> I don't follow you. What do you mean with this comment?
> 

PORT_ID(sdev) uses ETH(sdev).

ETH(sdev) will now point to `&rte_eth_devices[(sdev)->data->port_id]`,
so data->port_id will be zeroed on sdev close.

So once the sub-device has been closed, calling
rte_eth_dev_release_port(ETH(sdev)) is not possible anymore, justifying
the change (taking the reference to ETH(sdev) first, then using it after
shared data has been overwritten).

> > What happens when a primary process closes a device before a secondary?
> > Is the secondary unable to stop / close its own then? Isn't there some
> > missing uninit?
> 
> Is the secondary process supposed to do any closing?
> The device management should be done only by the primary process.
> 
> Note: anyway all this hotplug related code should be dropped
> from failsafe to be replaced by EAL hotplug management.
> 

I don't know, I've never used secondary process.
However, cursory reading the code of rte_eth_dev_close(), I don't see
a guard against calling it from a secondary process?

Reading code like

   rte_free(dev->data->rx_queues);
   dev->data->rx_queues = NULL;

within makes me think the issue has been seen at least once, where
shared data is freed multiple times, so I guessed some secondary
processes were calling it. Maybe they are not meant to, but what
prevents them from being badly written?

Also, given rte_dev_remove IPC call to transfer the order to the
primary, it seems that at least secondary processes are expected to call
rte_dev_remove() at some point? So are they only authorized to call
rte_dev_remove() (to manage hotplug), but not rte_eth_dev_close()? Is
there a specific documentation detailing the design of secondary process
and the related responsibilities in the lifetime of a device? How are
they synching their rte_eth_devices list if they are not calling
rte_eth_dev_close(), ever?

> > This seems dangerous to me. Why not instead allocating a per-process
> > slab of memory that would hold the relevant references and outlive the
> > shared data (a per-process rte_eth_dev private data...).
> 
> Which data do you think should be allocated per process?
> 
> 

[-------- SHARED SPACE --------------] [-- PER-PROCESS --------]
+--------------------------------------------------------------+
| +------------------+                +- rte_eth_devices[n] -+ |
| |rte_eth_dev_data  |<---------------+ data                 | | PRIMARY
| |                  |   +dev_priv-+  |                      | |
| |      dev_private +-->|         |  |                      | |
| |              ... |   |         |  |                      | |
| |          port_id |   |         |  |                      | |
| |                  |   |         |  |                      | |
| |                  |   |         |  |                      | |
| |                  |   |         |  +----------------------+ |
| |                  |   |         |  +- rte_eth_devices[n] -+ |
| |                  |   |         |  |                      | | SECONDARY
| |                  |   |         |  |                      | |
| |                  |   |         |  |                      | |
| |                  |   |         |  |                      | |
| |                  |   +---------+  |                      | |
| |                  |<---------------+ data                 | |
| +------------------+                +----------------------+ |
+--------------------------------------------------------------+

Here port_id is used within fail-safe to get back to rte_eth_devices[n].
This disappears once a device is closed, as all shared space is zeroed.

This means that sometimes ETH(sdev) and PORT_ID(sdev) is correct,
and at some point it is not anymore, once a sub-device has been
closed. This seems dangerous.

I was thinking initially that allocating a place where each sdev would
store their rte_eth_devices / port_id back-reference could alleviate the
issue, meaning that the fail-safe would not zero it on sdev_close(), and
it would remain valid for the lifetime of a sub-device, so even when a
sub-device is in DEV_PROBED state.

But now that I think about it, it could probably be simpler: instead of
using (ETH(sdev)->data->port_id) for the port_id of an sdev (meaning
that it is dependent on the lifetime of the sdev, instead of the
lifetime of the failsafe), the port-id itself should be stored in the
sub_device structure. This structure will be available for the lifetime
of the failsafe, and the port_id is correct accross all processes.

So PORT_ID(sdev) would be defined to something like (sdev->port_id), and
ETH(sdev) would be (&rte_eth_devices[PORT_ID(sdev)]). It would remain
correct even once the primary has closed the sub-device.

What do you think? Do you agree that the current state is dangerous, and
do you think the solution would alleviate the issue? Maybe the concern
is unfounded and the only issue is within fs_dev_remove().

-- 
Gaëtan Rivet
6WIND


More information about the dev mailing list