Tuesday, July 31, 2018

Weirdest Networking Issue I've Ever Seen

Wanted to bring this up to the community because I'm at a total loss for how to proceed. Has anyone else seen an issue remotely like this?

Background: Running two Cisco 4500-X switches as a VSS to provide connectivity to the wiring closets in one of our buildings and also to other nearby buildings. There are primary and secondary layer 3 links to this VSS from our core sites using EIGRP to provide redundant connections to this particular location. The primary uplink connects to the active 4500 while the secondary connects to the standby 4500.

Scenario: First noticed the issue shortly after initial installation where two-way voice traffic was not functioning properly on VoIP phones. After extensive troubleshooting we discovered that shutting down one of the links on the port-channel to these downstream switches fixed the issue. As soon as the etherchannel bundle was restored the problem resurfaced. This occurred on either link going back to one of the 4500s, not only on the active or only on the standby switch

We also soon discovered that the secondary route would not function properly when the primary failed. The secondary still shows up as an EIGRP neighbor and weirdly enough I can still ping/ssh to other devices in our network but hardware routing seems to fail completely and devices can't actually connect to the internet. Problem is fixed as soon as the primary route is restored.

Troubleshooting: We've replaced cables, tested fiber, replaced transcievers, ect. We checked the configuration multiple times but found it is essentially identical to other locations that are working just fine. One of our other buildings is so similar it even has the same floorplan and an identical network design and the configurations match; this other location has never had a problem. Before anyone suggests I double-check this: the config is not the issue.

We broke VSS functionality on these switches and tried rebuilding it from the ground up. We switched the active and standby around. We tried replacing the active with a spare 4500. We replacing the standby with a spare 4500. Nothing has worked. The only thing left I haven't done is try replacing both switches in case hardware on both have happened to fail. We only have one spare 4500 so I have yet to do a full replacement to see if that fixes the issue. While there is a limited lifetime warranty on all 4500-X's I'd rather exhaust every potential solution before going through the RMA process. Plus if I performed a full replacement and the issue was still there I'd feel like a real asshole.

Has anyone ever seen an issue like this? I've been in networking over ten years and never encountered any problem like it.



No comments:

Post a Comment