Friday, April 17, 2020

First Career Core L2/L3 Network Replacement - Datacenter/Campus Shutdown and Startup

Hey All,

I'll be performing my first collapsed core shutdown in my career. It's a hybrid campus and datacenter mix in a small/mid environment with netapp, ucs, vmware stack.. I wanted to reach out to see what advice the community would recommend on things/issues to watch out for during a shutdown while replacing the L2/L3 core and start up. Leaning on the experience of the community would really help me out me a lot, especially while the network is so critical during a COVID-19 world we live in today.

I'm the only senior member of my team with the knowledge of all systems to be able to perform this maintenance window and I don't have a change control and/or peer review partner to lean on at the moment with some of my team/head counts not being filled at the moment which has put a lot of workload on me, which to be honest made me learn a lot from improving my storage, virtualization, networking, systems, and ansible/automation skills to be able to support a datacenter/campus.

I'll be replacing a 6504 VSS stack and nexus 5k distribution with nexus 9K's and reconnecting the access layer and datacenter (UCS and NetApp) port-channels. These 6504's have been running a bit too long :) with EoL next year glad to replace them and get updated code. At the same time I'll be removing VTP from the switches and re-configuring the VLANs on the access layer (the exact same VLANs trying to minimize changes)

My current plan is as follows in a converged infrastructure in a non-certified "flexpod"

  1. Make sure we have all documentation for IPAM, management addresses, and non-domain/active directory passwords in a location that is accessible while the network is in outage mode and export the password vault system to an encrypted offline file and all SSH private keys. In addition also have TFTP copies of all startup/running configs for all cisco devices, UCS, VMware config export, RVtools export, netapp sysconfig, and palo alto firewalls export.
  2. Shutdown all VMware level VMs and domain controllers (hosting DNS, DHCP, and AD).
  3. Shut down ESXi Cisco UCS hosts but leave the UCS/FI's running.
  4. Leave NetApp cluster (A700 and FAS8200) running.
  5. Replace Core l2/l3 network with new Nexus 9k with a logical migration of the configuration (which I"m confident in the most as most of my experience is in networking).
  6. Start up ESXi hosts/blades via UCS FI mgmt https once l2/l3 back online which mount NFSv3 to netapp datastores (which haven't been shutoff).
  7. Connect to ESXi host mgmt https and start up domain controllers (bring up DNS, DHCP, and AD service) with offline local ESXi admin/password.
  8. Once AD/DNS is running bring up the rest of the environment/services and test everything with the whole team.

Thats my rough/high-level playbook. I'll be labeling all fiber/Cat6 mgmt/data ports ahead of time to allow re-patching to be easy, fast, and stress free.

I have a couple of fluke fiber cleaning kits while disconnecting all the SMF/MMF fiber and have extra SMF/MMF patch cords and Ethernet cat6 patch cords as well. At this point I feel like I'm ready and have nothing else to do, so I figured talking to the community would be the best course of action right now.

Extremely gratefull for any feedback, insight, and advice for my plan I've laid out in advance and your experience doing similar work.



No comments:

Post a Comment