Monday, August 16, 2021

What could possibly break a specific subnet across every switch simultaneously and leave other subnets on the same hardware unaffected? (post mortem, I don't want it to happen again)

Environment with a collection of UniFi switches. VLANS are not used. System has been stable with this configuration for > 4 years. No configuration changes were made before the system went down. No power or water incidents. The firewall never lost connectivity across a tunnel from another site, but the configuration page could not be accessed from the LAN, even though it could be pinged. Restarting the firewall did not resolve the problem.

Devices on one subnet and only one subnet repeatedly lost connectivity for a lengthy period of time.

Devices on the same physical wiring would not have any connectivity issues at all.

For example:

PC with IP address of 10.1.1.10 is plugged into a VoIP phone that has an IP address of 10.10.1.15. The phone worked, the PC would not.

PC with IP address of 10.1.1.15 plugged into a network jack with a straight run to switch 1 (out of 5), no connectivity. Change the Ip address of the PC to 10.10.1.20 and it works flawlessly.

Rebooting the Windows server where DHCP lives, the subnet came back up immediately on reboot, then 3 minutes later died again.

Rebooting the SonicWall got the subnet working again, then it died after about 3 minutes.

No excessive traffic spikes. Nothing that looked like a packet storm or a routing loop anywhere.

Eventually I rebooted every switch simultaneously using the Uniquiti Gen 2 cloudkey. Even the switch that has nothing on it but the subnet that never went down. After all of the switches finished the restart the problem was gone completely and stayed up.

I've been digging through the device logs and see the switches losing connectibity with the CloudKey, but other than that haven't found anything unusual yet.

It seems to be some combination of something, but how to identify the specific combination that caused the problem? I've seen hardware fail, you identify the failing component, fix/replace it and life goes on, but there isn't any hardware that actually failed that I can identify as the problem. I've seen packet storms, loops and other faults, but those generate traffic spikes that stand out like a sore thumb. So if not the modem, firewall, server, specific switch or specific access point, what else could this be?



No comments:

Post a Comment