Friday, April 2, 2021

Postmortem analysis - what was happening here?

Weird problem I don't quite understand:

Two gateways, different providers

These providers are connected to a switch

Switch is connected to two Sonicwall NAS devices in primary/ha standby mode, both configured identically with the failover/lb set to use both of the two gateways

Users started to report slow internet, a VPN tunnel reset itself every minute or two

ping -t to both of the SonicWalls, steady connection, no packet loss

ping -t to the intermediary switch, steady connection, no packet loss

ping -t to 8.8.8.8 and to the primary gateway showed a consistent pattern of about 60 replies, then 8 timeouts, then another 60 replies, repeating ad infinitum.

This pattern of packet loss was observed regardless of pinging host: from a PC, from the Sonicwall, or from the switch between the Sonicwall and the gateways - pinging from the switch to the gateway resulted in the same issue even though it was not passing traffic through the Sonicwall at all

As I isolated components one by one I eventually discovered that it was being caused by one of the two Sonicwalls - unplugging that sonicwall fixed the link between the switch and the gateways.

Packet monitoring didn't show anything interesting.

What could possibly have been going on here? No packet storms seen, just something with the sonicwall was causing the link between switch and gateways to drop on a set cycle. There was a period of flickering lights/brownouts so that's my guess as to what caused the problem, whatever it was, but I'm curious as to what was actually happening.



No comments:

Post a Comment