Tuesday, April 27, 2021

One of our Buildings Suddenly Went Down Offline This Weekend

Hi all,

We had a network outage in one of our buildings on the weekend just gone. I wasn't on-call so my colleague had to deal with it.

The building access layer switches all connect back to a distribution switch stack (a stack of two Cisco Cat 9200L units - yes I know 9200L is an access layer switch but there's barely any load on them) and from this switch stack we have a cross stack ether-channel that connects back to our two main server rooms on-site, our "core" Cisco C-6509 VSS chassis pair using a Layer 2 MEC. Luckily, I recently built a new syslog server so we do have some logs to help show what happened during this outage. It happened on Sunday 25th March at 4:16am. Here's the syslog for the switch stack and the core side:

https://github.com/smartiedude/Issues/blob/main/2021-04-25--Syslog.txt-switch-stack1.txt

https://github.com/smartiedude/Issues/blob/main/2021-04-25--Syslog--core-6509-side.txt

I've also attached a gif showing a picture of the topology to help you visualize it:

https://github.com/smartiedude/Issues/blob/main/Drawing2.gif

Looking at the switch stack side logs I can see that both stack members have reloaded... Chassis 2, followed by Chassis 1... in the stack. I have no idea why this happened. I have some questions I don't understand that I was hoping you might be help me to make some sense of....

  • Why did the switch stack suddenly decide to reload on it's own at 4am?
  • On the core side logs, the two interfaces on the "core" C-6509 VSS chassis in both server rooms went into an 'error disabled' state. Why is this? There's two logs on the core side at 04:17:57 that indicate this was because of a "channel-misconfig error" but I don't understand why a switch stack chassis member going down at the other end would suddenly be classified as a misconfig error.
  • Why did STP start flapping on Po22 on the core side? I was under the impression that if one of the Po members dies then STP should remain stable because the Po22 and all it's members are considered one individual link.

My colleague didn't quite understand what happened or what caused the outage at the time he was called. All he told me was that he logged into the core side and brought the two downward facing interfaces back up by 'shut', 'no shut' to get them out of err-disabled state on the core side (which you can see in the logs because I've got command archiving being logged too so I can see what commands anyone entered on the CLI) and it all started working again. He didn't know that both chassis had actually reloaded on the downstream building side switch stack until I showed him in the logs afterwards.

Any info, advice or experience is welcomed.

Thank you my friends.



No comments:

Post a Comment