The company I work for now is the only global company I've ever worked for [in IT anyway], so I'm only familiar with my own company's processes etc. I'm curious how other folks in similar environments handle the struggle I'm about to discuss below.
Like everyone else we have to periodically upgrade the Nexus 5k's which handle our servers in our DC's. We'll have for example various types of servers connected [primary link] to Switch01 and the redundant [secondary] connection into Switch02.
We will fill out a form to request the business and technical approvals to do the upgrade. We'll populate a spreadsheet which outlines all of the hosts connected to these particular switches. Then we assign tasks for each server group [Windows/Linux/Unix/AS400 etc] to check their hosts and make sure both NIC connections are online. Assuming we get all the approvals AND the various teams check their servers and give us the thumbs up...we schedule the upgrade for whatever weekend.
There are inherit issues with our process as it is now. The way I get the listing of servers on any given switch is to do a interface description dump. Not very scientific. The server teams I'm almost certain they just check to make sure both NIC's are connected....what isn't necessarily being verified is are they connected on both switches where they should be? We found a case where a server was connected to the same switch twice.
So I'm just curious in a nutshell what tools or processes other folks are using to ensure that when they take a switch down, things will fail over to the other switch and you won't have too many issues.
One process we are actually adding is to have a failover "test" the weekend before the actual upgrade. So we would shut down a fex at a time and see if any hosts go offline or not. If they do, within seconds we'll have them back up and then we can diagnose why the issue happened.
No comments:
Post a Comment