Tuesday, February 6, 2018

Aruba says it's the network: Wireless authentication timeouts

Hey, everyone, I've got a problem I've been working on for a bit and thought I'd toss it out here for any additional insight the community could provide. We have a basic hub-and-spoke network with all of our sites coming off the central data center across WAN links. All of the sites are set up the same way but I'm seeing an issue at one particular site. We are seeing wifi authentication timeouts from all client types at this site (chromebooks, iPads, iPhones, Windows 10 laptops) and the experience on the user side is it can take anywhere from 0 seconds to 5 minutes for a client to connect to wifi. We are using 802.11x certificate and PEAP authentication (to tie a user name to a client device). I've looked at our Clearpass (authentication server and seem timeouts occur at all sites but really only heard of complaints from this particular site. I've visited the site myself and it's the only site where I've been able to replicate the issue personally. If I try to connect, about half the time, I get a "network not available" and then I try to reconnect and it will work after one or several more tries.

I worked with Aruba (we have their wifi and Clearpass authentication server in production for wireless) tech support and we did a lot of troubleshooting. We compared configs of sites that work with the local controller at this site. I've replaced the local controller at the site. I've done iperf3 TCP and UDP tests across the WAN link to this site and compared those results to several of my other sites. I've compared switch and router configs from all sites. I've looked at routing tables, pings, MTRs, jitter, and latency. I've set up a simpler wifi SSID at the site that only uses PEAP to take certificates out of the mix and still have the issue on the simpler SSID. I've done simultaneous packet captures on the local router, the wireless lan controller, both routers in my data center (where the packets would need to travel to reach the authentication server) and finally, Clearpass (authentication server). After comparing these with Aruba support, we have been able to narrow down the problem to RADIUS UDP packets not getting back to the client and then that authentication session times out. From what I saw, the packets all reach the Clearpass server but when we see the timeout, the clearpass server just never sends the response back to the client. These packets go over UDP 1812 and what we should see is an access-request from the controller (on behalf of the client) and then an access-challenge from Clearpass. This goes back and forth until Clearpass either responds with an access-accept packet or an access-reject. When we see the timeout happen, Clearpass fails to send anything back during this process. So it might receive a access-request and then send an access-challenge back to the client and then another from each and then finally, the client will send an access-request and no access-challenge is sent back for that packet. Finally, that session times out and a log in Clearpass is generated for that session. Aruba just keeps saying this is a network issue (of course).

Finally, I did something to "fix" the issue, at least temporarily. What I did was fail all of the APs at this site over to our backup controller that is located in the data center. When I did this, we stop seeing the issue at the site and stop hearing complaints from users. This does nothing to change the topology but the only thing I can think of that changes is where the AP GRE tunnel drops the client traffic off. When the APs terminate to the local controller at the school, all wifi client traffic terminates there onto the local switch. When I fail the APs over to the WLC at the data center, wifi traffic gets dropped off there, they get a DHCP IP from the data center DHCP pool and then that authentication traffic goes straight from the WLC located in the data center over to the Clearpass server (in the same DC).

Here is a diagram showing the differences: DIAGRAM

This made me think our WAN provider might be dropping UDP traffic or something but again, I've done iperf3 tests on UDP 1812 and not dropped any traffic during those tests. The traffic would be tunneled through GRE across the WAN in the "working" configuration and just sent raw as UDP 1812 over the WAN link when I am seeing the issue. This theory contradicts what I'm seeing with Clearpass failing to send responses when I see requests on the packet capture hitting Clearpass, however.

I'm stumped of what to even try next. Any ideas? Sorry for the long post! Just wanted to include all of the details and thanks in advance for any replies to this thread!



No comments:

Post a Comment