Last week I removed an snmp community profile context name from a tenant, leaving only a community string name enabled. It resulted in a null context name being pushed down, and triggered a bug. 8 hours and a TAC case later I found out that this was the change that caused snmpd to start coreing on all my leaves, enabling them to go into reboot loops. Essentially, snmpd would core, the switch would reboot as a result, and when it came back up it would have the same broken snmp policy and the process would start all over again. We had to enable no-hap-reset to get it to stop rebooting automatically, then we were able to stabilize the fabric. There is an open bug for this CSCvf89664 and upgrading to 3.1 has apparently fixed the issue. I was on 3.0(1k) and the bug was not open when we did our initial bug scrub. Initial feedback from our account team, among other things, is to get a better process in place for reviewing bugs are they are opened for our running version and before we make changes. Very reasonable advice but I'm not sure I can see all the severe bugs I would care about from my CCO account.
Needless to say, people in my team are freaked out because the narrative becomes "make a simple change, entire fabric starts flapping." I don't want to be walking on eggshells whenever I put in an RFC with our CAB for a simple thing. Maybe this is the reality we need to acclimate ourselves to with central policy servers like the APICs. Two questions a) have those with ACI experienced similar issues where the blast radius is so enormous for a change like this? and b) what is the cooler/calmer perspective on keeping ACI around despite challenges like this?
No comments:
Post a Comment