Friday, April 17, 2020

My banter of the day. Upgrading Cisco Nexus switch.

Hi all,

I would like to share with you the adventure I have had the latest 3 weeks in my attempt to perform a firmware upgrade on a pair of Nexus 5596UP switches.

Background: We have 2 nexus 5596UP switches with a L3 card installed and multiple vPCs going through them. No FEX devices, but 3750s are hanging out of those switches, among other things.

Those switches run EIGRP, BGP, IPv6, PIM, MSDP to name a few. They had around ~650 days of uptime and they are both complaining about TCAM Exhaustion among other things. Anyway, we hit a bug that would not prepend AS paths when being redistributed via BGP so i thought i would take the challenge to upgrade them.

I write the change, I take a show run, ver, spanning tree, vpc, license, arp, arp all vrfs, ip route, ip route all vrfs and begin to upgrade switch 2.

I issue the command "install all kickstart bootflash:///kistart-whatever.bin system bootflash:///whatever.bin <cr>

It does the calculations, tells me that I am running Layer 3 services therefore a downtime HAS to happen and I press "y". The switch initiates a reboot. Note that I was doing that remotely, via a remote console server.

I can see in the console that the reboot process has initiated, and then nothing. I wait 5-10-15 minutes and no output from the console. I then decided to physically visit the DC.

Entering the comms room, I can see that the switch is powered on, but no LEDs are blinking besides the PSU and the STAT. No console output. After a couple of power cycles, I called Cisco to raise a P1 case. While waiting for the case to be created, I searched on my phone "Cisco nexus not booting".

I came out with FN 64094 - Nexus 5596/UCS FI 6296 - System Fails to Boot After a Power Cycle - Workaround Provided
https://www.cisco.com/c/en/us/support/docs/field-notices/640/fn64094.html

I was thinking "sh*t".

TAC entered the WebEx chat and asked me to provide console output. I said there is none. I have already power cycled twice with no luck. I have then pasted the FN URL in the chat. The engineer responds with something like:

"Please give me a minute to read this"

3 minutes later:

"I will send an RMA unit."

Note that this was a Saturday in a COVID-19 lock-down! Although a bit stressed, this was the most exciting thing that happened to me in days! I have arranged with the courier, the DC provider and the device came. I then unboxed it, put the current firmware in, tried to copy the whole config (note it does not work and manual intervention is required), and notice that we are missing features due to licensing.

I then decide to label all switchports with the label printer. Power down the faulty switch. Understand how to unrack that beast; remove the ears from the broken one, put them in the RMA, rack the RMA back, plug all the cables in and power it on. Note that all ports are currently in shut state because the config is not fully applied yet.

I save the old switch in the racks somewhere, but I could not save the box since we do not have a cage in that DC, and all the OPs left (it was 22:00 local time). So I put it in a corner somewhere, hoping they do not dispose it.

The next day I am reporting that to my line manager and explaining that I need the license to be installed before finishing config. License came, I have installed it, finished copying and verifying the config and I go physically to the DC. I get there, I do a "no shut" on all the ports and so far so good!

The next morning I get some email alerts showing that the switch has crashed overnight due to:

Reason: Reset performed due to component Error

System version: 7.3(3)N1(1)

Service: SUNNYVALE ASIC FAILURE

Sent the tech-support output to Cisco which then dispatched another RMA. Note that this was during the Easter weekend which the device was supposed to be sent on Thursday afternoon, but came Monday at around 21:00. So I raised another emergency change to swap the RMA with another RMA switch.

I am now at the same place I was before the upgrade. I am in the process of raising a change to preemptively patch Switch1 so it does not hit that field notice. After that I will be able to successfully upgrade the 2 switches.

TL:DR. I tried to upgrade the firmware on a Nexus switch, the device crashed, Cisco sent an RMA which crashed the next day and they sent an RMA for their RMA.



No comments:

Post a Comment