Monday, April 30, 2018

Cisco WLC causing strange behavior

I think, anyway.

Backstory - I'm in the middle of a migration between 2 wireless networks at a client - a large manufacturing facility. When I started, they had 2 5508s in a ghetto-HA setup with about 75 APs. I pulled one of the controllers from that group, set up a new SSID on it for their new network and started adding APs. At present, about 80% of their network is connecting to the new SSID on the new APs. The old network is still in use in some areas for at least a few more weeks.

Last wednesday, the controller on the old network started going offline briefly. It lasted about 90 seconds, all APs dissociate and the network interface in the controller does not respond to ping for about 60-90 seconds, then comes back up for 5 -10 seconds, then goes offline again for about another 20-30 seconds. This happens at least once per day, but it's not predictable. It doesn't generate any specific logs that suggest an error, that I've noticed, but I'll reply to this post with an excerpt of the logs I captured.

The really strange thing is that when this happens, at the exact same time, the other controller (we'll call it "new") - which shouldn't be talking to the "old" controller as I removed them from the RF group etc, also goes down for about 30 seconds. It responds to ping and does not lose its APs, but it does not pass any traffic for that time.

The "old" controller is running software 8.0.140 - The APs are 1130s, 1140s, 2602, 2702

The "new" controller is running software 8.3.133 and the APs are all 3802s and 2702s.

I've checked the stats on the switch ports that they're connected to and see nothing strange. Both WLCs have redundant GB uplinks to their switches.

Wireshark isn't showing any strange broadcasts or anything from either controller when it happens, and I don't notice anything else that might cause it. Wired network seems to be unaffected.

Anyone have a guess? Neither controller is under support currently, and new controllers aren't scheduled to be purchased until June. I'm guessing that the 2 WLCs are talking to each other for something and that's why it impacts both, but what it is and why is still a mystery.

The following is an excerpt from the message log on the "old" controller when the outage occurs.

*spamApTask0: Apr 25 08:07:30.232: %LWAPP-6-CAPWAP_SUPP_VER: spam_lrad.c:1874 Discarding discovery request in LWAPP from AP 00:1a:30:c2:ec:e0 supporting CAPWAP *spamApTask6: Apr 25 08:07:29.530: %DTLS-5-ESTABLISHED_TO_PEER: openssl_dtls.c:766 DTLS connection established to 10.1.100.124 *spamApTask0: Apr 25 08:07:27.399: %LWAPP-6-CAPWAP_SUPP_VER: spam_lrad.c:1874 Discarding discovery request in LWAPP from AP 00:1c:b1:07:04:b0 supporting CAPWAP *spamApTask0: Apr 25 08:07:27.399: %LOG-4-Q_IND: capwap_ac_sm.c:7983 The system detects an invalid AP(00:1c:b1:07:04:b0) event (Capwap_configuration_update_request) and state (Capwap_dtls_teardown) combination *spamReceiveTask: Apr 25 08:07:27.369: %CAPWAP-4-INVALID_STATE_EVENT: capwap_ac_sm.c:7983 The system detects an invalid AP(00:1c:b1:07:04:b0) event (Capwap_configuration_update_request) and state (Capwap_dtls_teardown) combination *spamApTask3: Apr 25 08:07:27.362: %CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7224 Did not receive heartbeat reply; AP: 00:1c:b1:07:04:b0 *spamApTask3: Apr 25 08:07:27.362: %LOG-4-Q_IND: spam_apf.c:290 The sytem is unable to find AP 00:a2:ee:57:5c:d0 entry in the database, could not process delete WLAN ALL MNs message *spamReceiveTask: Apr 25 08:07:26.169: %LWAPP-4-AP_DB_ERR1: spam_apf.c:290 The sytem is unable to find AP 00:a2:ee:57:5c:d0 entry in the database, could not process delete WLAN ALL MNs message *spamReceiveTask: Apr 25 08:07:26.169: %LOG-4-Q_IND: capwap_ac_sm.c:7983 The system detects an invalid AP(f8:0b:cb:b1:a2:00) event (Capwap_configuration_update_request) and state (Capwap_dtls_teardown) combination *spamReceiveTask: Apr 25 08:07:26.168: %CAPWAP-4-INVALID_STATE_EVENT: capwap_ac_sm.c:7983 The system detects an invalid AP(f8:0b:cb:b1:a2:00) event (Capwap_configuration_update_request) and state (Capwap_dtls_teardown) combination *spamApTask3: Apr 25 08:07:26.165: %CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7224 Did not receive heartbeat reply; AP: 00:a2:ee:57:5c:d0 *spamApTask1: Apr 25 08:07:26.162: %CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7224 Did not receive heartbeat reply; AP: f8:0b:cb:b1:a2:00 *spamApTask1: Apr 25 08:07:26.162: %LOG-4-Q_IND: spam_apf.c:290 The sytem is unable to find AP 00:21:d8:92:a7:00 entry in the database, could not process delete WLAN ALL MNs message *spamReceiveTask: Apr 25 08:07:25.971: %LWAPP-4-AP_DB_ERR1: spam_apf.c:290 The sytem is unable to find AP 00:21:d8:92:a7:00 entry in the database, could not process delete WLAN ALL MNs message *spamApTask0: Apr 25 08:07:25.962: %CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7224 Did not receive heartbeat reply; AP: 00:21:d8:92:a7:00 *spamApTask0: Apr 25 08:07:25.723: %LWAPP-6-CAPWAP_SUPP_VER: spam_lrad.c:1874 Discarding discovery request in LWAPP from AP 00:25:45:26:46:00 supporting CAPWAP *spamApTask0: Apr 25 08:07:25.723: %LOG-4-Q_IND: spam_apf.c:290 The sytem is unable to find AP 00:a2:ee:44:35:c0 entry in the database, could not process delete WLAN ALL MNs message *spamReceiveTask: Apr 25 08:07:23.966: %LWAPP-4-AP_DB_ERR1: spam_apf.c:290 The sytem is unable to find AP 00:a2:ee:44:35:c0 entry in the database, could not process delete WLAN ALL MNs message *spamApTask0: Apr 25 08:07:23.962: %CAPWAP-3-ECHO_ERR: capwap_ac_sm.c:7224 Did not receive heartbeat reply; AP: 00:a2:ee:44:35:c0 


No comments:

Post a Comment