Tuesday, March 16, 2021

Interesting LACP failure yesterday

Topology

The issue was reported as: "everything works normally except for one container (which had just been started) can't get a DHCP address"

Both DHCP relays and both DHCP servers seem to be operating normally. PCAP at the DHCP server shows DISCOVERs and OFFERs, but no REQUEST or ACK associated with this MAC.

PCAP at the server's bond0 shows the same thing: Only DISCOVERs and OFFERs. But there is one interesting peculiarity: the DISCOVERs seem to come in back-to-back pairs, with only about 0.1 - 0.2 ms between them. Could there be a loop?

The switch shows the container MAC is learned via the server-facing LACP-based aggregate, so that's okay.

The bridge in the server shows the container MAC is learned via the bond0 interface ... toward the switch ... What?

The switch is not logging MACFLAP messages.

Then I noticed that the aggregate is operating with only one link ... according to the switch. But two links according to the server.

On the switch side, Gi2/0/x is not aggregated, but rather operating as a standalone member of the VLAN.

On the server side, both links are participating in the bond (according to /proc/net/bonding/bond0).

Some more PCAPs to validate cabling (LLDP messages) proves the cables are all in the expected holes.

Then I noticed that in /proc/net/bonding/bond0, the problem link shows speed: unknown and duplex: unknown. So... That's interesting?

PCAPs show no LACP PDUs coming from the server on the problem link. The switch was right to pull it from the aggregation.

I bounced the problem link from the switch side. No change.

Ultimately, what seems to have happened here, is the bonding driver on the server failed, splitting the aggregation, but only on the switch side.

Normal (unicast) server/container traffic was relatively unaffected due to flow hashing on the server side: The various containers (each with unique MAC/IP) hashed to one link or the other on egress from the server, so the switch L2 table has learned them via either the aggregate or the problem link 2/0/x.

In the opposite direction, the bond interface doesn't care which leg is used by the switch.

The only problem case was the DHCP discover (a broadcast frame): Because the switch flooded that frame to all ports, the Linux bridge heard it return via the other leg of the split aggregation. The Linux bridge re-learned the container MAC via bond0, and dropped the (Ethernet unicast) DHCP OFFER on ingress because "that destination is reachable via the ingress interface".

I disabled the problem link and everything started operating normally. After an hour or so, I re-enabled it to gather some more data, and LACP brought it into the bundle immediately. The bond driver is working now, apparently. Hilarious



No comments:

Post a Comment