Wednesday, December 12, 2018

Well I finally did it...

I took down an entire building. (yay)

I am working as a wireless netadmin in higher ed. We are currently running an Aruba 6.x environment and I am currently building out an 8.x environment side by side. My new controllers are on our production subnet (first mistake I guess but I don't have a full test environment). So it was already a busy day, I spent the majority of my day with TAC trying to get rid of an inherited VLAN on my controllers which forced almost an entire rebuild of the environment in order to remove it. My boss left early to get to a doctor's appointment and it was nearing the end of the day. I figured I could probably leave right at 5, get home at a nice time and enjoy my night. So I'm finishing rebuilding the environment by adding my controllers back to the cluster and because of our use of CoA, I needed to add two more VIPs for each member in the cluster. I have a dedicated subnet for this, but I decided because I was getting incorrect information into RADIUS that I would try using the same subnet as my controllers for it...just as a test.

So my controllers are on x.x.x.3 and x.x.x.4 with its VIP on .2. I figured I would just use .5 and .6 for the other two addresses to test. I ping the addresses, nothing responds so I think I'm good to go (third mistake). I add two more SVIs on the controllers with .5 and .6 and give it a whirl. Same issue with RADIUS so I pack up my stuff and get ready to leave.

...Then I get an e-mail from Airwave.

Triple digit number of APs reported as down.

What?

I log into the router, gateway is up. I do a source ping to the controllers from the gateway, all good. I ping some of the APs. Nothing. I log into the production controller, zero APs listed. I can't remember exactly what I said, but I think it was "Oh darn!" or something like that. I log into the master and I'm greeted with pages of APs "upgrading." Oh god. I look to my AOS8 controller and lo and behold, there they are.

What is going on?!

All the APs are now terminating to my test setup. I didn't change a profile, I didn't touch an lms-ip, what could possibly be going on? I check the provisioning profile and it's pointing to the master with the TFTP IP pointing to the production controller, all good there. I dig down into the system profile and I find it. The LMS-IP (in Aruba land, that's where the APs terminate their traffic) is set to .5 even though that is not used by any controller. Well, at least not until I decided to make .5 an active interface. This was before my time, but our procedure is to boot all APs to the master and then provision them to the local controller.

So I immediately shut down the ports on my 8.x controllers and the APs downgrade and reattach themselves to the production controllers.

All in all, about 20 minutes downtime but lucky for me, no tickets generated.

First networking job. Gave myself a heart attack, but now I can laugh about it and tell all you fine folks about it. Hope you enjoyed the story.



No comments:

Post a Comment