Thursday, May 9, 2019

Since Friday have experienced multiple power/network failures. 95% back, can't get *some* Macbooks onto Wireless

Hi all -

Really strange one here that has me starting to pull my hair out. A disclaimer, I am not a network admin but I am in contact with our network admins so I can relay things back and forth with them.

The short story is Friday we had two unexpected power downs when our UPS was down for maintenance. We lost all power to networking/server gear. I had to replace one Cisco 3750 in a 4 port Stackwise stack but that was the only piece of gear we lost.

Sunday we had a building wide power outage, but everything failed over to UPS, then generator as it was supposed to.

Tuesday we had a switching loop develop on one of the floors that somehow took our network down completely.

It's been a bad week.

The one outlying issue that has occurred since the initial outages on Friday (which was happening on Monday, before the entire network dropped on Tuesday) is that I have a group of Macbooks that are unable to receive any addresses from the floor we had to replace the switch on. It is only Macbooks and one iPhone that appears to be affected. We have the 2015 MBP and the 2018 MBP w/Touchbar for models and the iPhone that's affected is the iPhone 7s I think.

Edit: This has only been an issue since the outage. These Macbooks always were able to connect to wireless properly prior to the outage

The Macbooks themselves return a 169.x.x.x address but I don't think DHCP is the culprit because these Macbooks ALL connect on other floors, where the exact same infrastructure is in place. Additionally I manually set an IP address and confirmed I had network connectivity where I sit, but as soon as I go to the floor with an issue, I lose all connectivity.

Our infrastructure is Meraki for our APs, with majority of our switches being Cisco.

The troubleshooting I have done has spanned the last 3 days. I am at my limit for testing and am making no headway.

Here's the list I've done (and I'm probably forgetting some steps):

  • Tested across multiple Mac OS versions (Mojave and High Sierra)
  • Removed the network list and apple airport .PLIST files from the system configuration on the Macbook.
  • Ran multiple Wirecaps. If I filter by bootp references, I see when I initially connect to the network, nothing but DHCP Discover messages on the faulty floor but when I move to the other floors, the entire DHCP process flows properly.
  • Tested with a Windows PC - unable to replicate the issue, connects just fine
  • Removed any network management that we had in place by Jamf from the Macbooks having issues
  • Deleted and re-added the RADIUS server certificates

It appears that the RADIUS servers are authenticating the connection because in the network connection control panel, I see that the 802.1x is authenticated, it's immediately after that they seem to lose all traffic and connectivity and return the 169.x.x.x address.

I feel like this issue is like a black hole. I can observe the effects, I can run all the tests in the world to confirm it is there, but I. Can. Not. find the root cause of it.

I'm out of things to test, because I think the things I need to look at are things I don't even know about. Please help save my sanity.

Edit: one other thing of note. We serve 3 SSIDs from the WAPs and two of them get DHCP from Windows DHCP servers and one is served from another location.



No comments:

Post a Comment