Wednesday, February 10, 2021

IRF fails to come up after power issue

Hi folks,

I am looking to you guys for a bit of insight again. Forgive the length, but I’ve just pulled a double working on this issue and I need to process some info, if you’d be willing to bear with me.

Earlier today our site on the west coast had two blackout events back to back. The first one depleted our UPS’s and then the power came back on. The switch stack started to boot and then another blackout hit again. Because the batteries were depleted they died right away. Upon reboot only switch 1 of 2 in the stack came back online in the fabric.

The 2 switches are HPE FlexFabric 5945’s running in IRF using BFD-MAD. The first switch became pingable as soon as OSPF came up but the second remained down and unreachable. The IRF on the first switch did not list the second slot, so I had my on site contact reboot the second switch again. Still nothing. At this point I got HPE on the line.

HPE tech walks me through some checks and confirms config looks good to him. We discover that we can SSH to the 2nd switch if we plug into it directly (my tech didn’t have a serial cable...) and SSH to the management IP. So I’m controlling his computer and 2nd switch via cell/Zoom support session and the 1st switch using my normal method over S2S VPN.

We rebuilt the IRF config on both switches from scratch. We reseated cables, restarted switches, and each time we got the same result. The IRF port shows the right config but the port never goes from DOWN to UP, even though we know the ports pass traffic if we drop them from the IRF config. So cables and port seem OK. We even get LLDP changes initially on MAD port but nothing on the IRF ports.

As a last ditch effort, HPE guy drops BFD-MAD on both switches and saves the config. His theory was that the config on the second switch, which had the incorrect PVID on the BFD-MAD port, which we had fixed, was being overwritten by the first switch when the IRF merged. (We get PVID mismatch errors all of a sudden from LLDP on the BFD-MAD port in this case.)

He throws up his hands figuratively, and asks me to collect the diagnostic-information for both switches and FTP the logs to him.

Has anyone run into a similar issue before? IRF port refusing to come up and pass traffic, despite config rebuilds and cable reseating? I feel like these switches are gaslighting me or something, because I’ve not run into an issue this stubborn with them before. I’m not even certain the hardware wasn’t damaged because of this power issue, but I feel like I’m grasping at straws.

Thanks for listening. If you have any thoughts that may help diagnose or resolve, I would be infinitely grateful to you for your expertise.



No comments:

Post a Comment