[dpdk-dev] rte_eth_bond: Problem with link failure and 8023AD

Kyle Larose klarose at sandvine.com
Thu Nov 23 17:04:13 CET 2017

Previous message: [dpdk-dev] [PATCH 2/2] examples/ipsec-secgw: add target queues in flow actions
Next message: [dpdk-dev] [PATCH v1 0/7] net/mlx5: IPsec offload support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

I've been testing my LAG implemented with the DPDK eth_bond pmd. As part of my fault tolerance testing, I want to ensure that if a link is flapping up and down continuously, impact to service is minimal. My findings are that in this case, the lag is rendered inoperable if a certain link is flapping. Details below.

Setup:

- 4x10G X710 links in a 8023ad lag connected to a switch.

- Under normal operations, lag is steady, traffic balanced, etc
Problem:
If I take down a link on the switch corresponding to the "aggregator" link in the dpdk lag, then bring it back up, every link in the lag goes from distributing to not distributing to back to distributing. This causes unnecessary loss of service.
A single link failure, regardless of whether or not it's the aggregator link, should not change the state of the other links. Consider what would happen if there were a hardware fault on that link, or its signal were bad: it's possible for it to be stuck flapping up and down. This would lead to complete loss of service on the lag, despite there being three stable links remaining.
Analysis:
- The switch is showing that the system id is changing when the link flaps. It's going from 00:00:00:00:00:00 to the aggregator's mac. This is not good. Why is it happening? It's because by default we seem to be using the "AGG_BANDWIDTH" selection algorithm, which is broken: It's taking a slave index, and using that the index into the 8023ad ports array, which is based on the dpdk port number. It should translate it from the slave index into a dpdk_port number using the slaves[] array.
- Aside from the above, if you look, the default is supposed to be AGG_STABLE, according to bond_mode_8023ad_conf_get_default. However, bond_mode_8023ad_conf_assign does not actually copy out the selection algorithm, so it just uses 0, which happens to be AGG_BANDWIDTH.
- I fixed the above, but still faced two more issues:
1) The system ID changes when the aggregator changes, which can lead to the problem.
2) When the link fails, it is "deactivated" in the lag via bond_mode_8023ad_deactivate_slave. There is a block in there dedicated to the case where the aggregator is disabled. In that case, it explicitly unselects each slave sharing that aggregator. This causes
them to fall back to the DETACHED state in the mux machine -- i.e. they are no longer aggregating at all, until the state machine runs through the LACP exchange with the partner again.

Possible fix:
1) Change bond_mode_8023ad_conf_assign to actually copy out the selection algorithm.
2) Ensure that all members of a LAG have the same system id (i.e. choose the LAG's mac address)
3) Do not detach the other members when the aggregator's link state goes down.

Note:

1) We should fix AGG_BANDWIDTH and AGG_COUNT separately.

2) I can't see any reason why the system id should be equal to the mac of the aggregator. It's intended to represent the system to which the lag belongs, not the aggregator itself. The aggregator is represented by the operational key. So, it should be fine to use the LAG's mac address, which is fixed at init, as the system id for all possible aggregators.
3) I think not detaching is the correct approach. There is nothing in my reading of 802.1Q or 802.1AX' LACP specification that implies we should do this. There is a blurb about changes in parameters which lead to the change in aggregator forcing the unselected
transition, but I don't think that needs to apply here. I'm fairly certain they're talking about changing the operational key/etc.

How does everyone feel about this? Am I crazy in requiring this functionality? What about the proposed fix. Does it sound reasonable, or am I going to break the state machine somehow?

Thanks,

Kyle

Previous message: [dpdk-dev] [PATCH 2/2] examples/ipsec-secgw: add target queues in flow actions
Next message: [dpdk-dev] [PATCH v1 0/7] net/mlx5: IPsec offload support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list