Sunday, September 29, 2019

1-year-ago-me just saved my ass, aka "babbies first core switch failure"

tl;dr - Core switch died while I was halfway across the country. Due to how I built the network a year ago, there was no outage and I looked super cool and competent through my first major device failure.

Friday night I'm hanging out in the United club lounge at the airport waiting for my flight to start boarding when my phone lights up with texts from Solarwinds that some stuff has gone down, including a bunch of stuff (like our Edge switches) that made no sense.

So I VPN in, which was weird because if our edge switches were both down then I wouldn't have been able to connect to the VPN. I couldn't get to some devices through the regular network but I was still able to access them through our Cradle-point Out-Of-Band cellular backup network, and everything looked like it was still passing traffic just fine.

Initially I was thinking this was a Solarwinds freakout, but then after a couple of minutes of checking things I realized that one of the switches in our collapsed core (we have a pair of stacked C9300s that act as both Core and Distribution layer) had died.

But because I'd been neurotic about dual-homing all of our Access layer switches and server switches, and making sure that all other systems that connected to the core were as redundant as possible . . . no one noticed. There was some reduced bandwidth internally, but there was no downtime for anything and aside from us in the IT department, no one knew there was any sort of a problem.

By this time I'd boarded my flight, but I opened a TAC case from the in-flight wi-fi and once I got back on site Saturday morning I was able to sort out what happened.

It turns out that one of the switches in the core stack had experienced a spontaneous reboot for unknown reasons, but then it stayed down because the "Manual Boot" option was set. Once I was in the console and issued a Boot command, it came back up and everything was hunky-dory. I turned off the manual boot option, cycled in again and we're good.

Lessons learned:

  • Out-Of-Band management networks are super duper awesome and I'm so glad that I put it in place.
  • High-availability is super duper awesome and I'm so glad that I insisted we spend the money on it, rather than cheaping out and crossing our fingers that nothing goes wrong.
  • Some ethernet serial devices might be worth it so I can get into the console remotely, rather than just the management interface
  • Maybe I'm not as bad at my job as I'm always worried that I am.


No comments:

Post a Comment