Tuesday, December 29, 2020

Stackwise Problems on Catalyst 3850s

Hi, wondered if anyone has seen a problem I've just seen at work that might offer any advice.

We have many Catalyst 3850 stacks in our HQ that we use as access layer switch connectivity. We have been doing upgrade from IOS-XE 16.6.7 to 16.12.4 without issue, having performed over 350 upgrades on this switch model, including 50+ logical stacks. Yesterday we noticed some APs dropped unexpectedly soon after the last upgrade of the day and traced it back to a 5-switch stack, which showed that a single non-master/standby stack member had been removed from the stack. When we consoled to the switch, it was in ROMMON mode.

We disconnected the switch from the stack, copied over the .bin file again, unpacked the file and updated boot parameters, rebooted it and it came up fine on its own in Install mode, as expected. We powered off the stack completed, reconnected the 5th switch stacking cables and powered it on again, only to find that we now had the master and the 5th switch in the stack, but the other three were now showing as Provisioned, with no MAC address. Again, those switches were sitting in ROMMON even though they had successfully booted and joined the provisioned stack previously. The adjacent stack ports were showing as down and of course the other stack members were totally missing from the stack.

We were pretty confused by this point but we went ahead and manually recovered the other three switches, expecting all to now boot correctly (As the 5th one did), which actually worked for a moment, but then we saw errors in logs referring to losing connection with the standby switch (PEER_REDUNDANCY_FAILURE or similar, I'm typing from memory here). A stack member would go from READY to REMOVED, eventually return to INITIALIZING and back to READY, only for a different stack member to move from READY to REMOVED. While this was occurring, a new Standby would be elected and go through the HA Sync process. It resulted in essentially a cascading failure where the stack election process would repeat over and over again, resulting in different individual stack members repeatedly dropping out and rejoining the stack, almost as if the stack cables were damaged.

By this point, we were getting pretty late into our unplanned working time, and after testing with a completely new set of stack cables, and testing with only two switches in a stack and finding the same issue occurring, we gave up and replaced the stack completely with spare switches, and we also downgraded back to 16.6.7. This time the provisioned stack formed successfully, stayed online and we spent the rest of the night redeploying configs and testing services

For tech info - Stack members are numbered and have correct priorities configured (15-11). Stack ports would show as down but then come back up, which seemed erroneous as we saw it with multiple switches and multiple stack cables. We checked and rechecked IOS packages, cleaned and redeployed files, verified boot parameters as well as changing out stack cables themselves. Despite having this software revision on hundreds of devices by now, this particular stack just would not behave and we eventually gave up trying to fix it and just swapped them all out and deployed on the 16.6.7 code.

Has anyone see this happen with Stackwise 3850s on 16.12.x? Other than the switch platform itself being particularly slow to boot and the log messages, there wasn't really much to go on to explain this stack reelection behavior. We are planning to try to recreate the issue in our lab and escalate it to Cisco via our Cisco partner, but we also know that there are so many anecdotal experiences of odd behavior with stacks and we might not get anywhere.

Appreciate any insight or similar experiences which might help understand what is the most likely cause.



No comments:

Post a Comment