Thursday, March 14, 2019

What would have been the quickest way to Dx the google DNS problems last night, and would would the best configuration to resolve and prevent?

Got a call around 0400 UTC claiming every computer in the building is completely down! There is no internet anywhere! A quick remote in to my servers, my desktops later and I determine that the claims are exaggerated. But there are symptoms:

I have two facilities, A and B, about 45 minutes apart as the crow flies. Both are running 24/7 with overnight having only a skeleton crew so no on-site IT.

Buildings A and B both have an in-house DNS server that keeps local records and forwards other requests to the 8.8.8.8 and 8.8.4.4 servers.

I remote into a couple of machines at building B, no problems whatsoever.

Building A can connect to some sites but not others, usually receiving timeout errors, but there was an occasional weird message I've never seen about a protocol not being added to the host. Websites that were loading were either perfectly fine or possibly running very slow. FQDNs that allowed ping could ping without problem. Traceroutes didn't give any indication as to what may be wrong. At the time, downdetector wasn't indicating any problems with the big sites (ebay and google, which loaded, netflix and reddit which didn't). Double checked from my home, so no problems.

Since even when the problem isn't DNS it is always DNS, I use nslookup and query google servers directly and have no problems resolving anything.

Checked the firewall, no problems. Rebooted it for kicks and giggles just in case something weird was going on, still having problems.

Call up my fiber provider to check the circuit. They see nothing wrong with the link, but they have a note saying that there are a lot of reports of problems with google's DNS. The tech sets his own DNS servers to point to google and replicates the problems. He remotes into a box off premise somewhere, switches to the google DNS and replicates the problems there as well. Problem found, I update my DHCP servers at A (but not B since they aren't having any problems) to point to the ISP's DNS servers, everything working normally again.

What steps should I have taken to diagnose the DNS servers as being fault sooner? From the tests I ran it was resolving hostnames, but the performance was really slow. I wasn't seeing excessive ping times, just failed to load pages or really slow performance on some websites but not others. What should I have been looking for to spot that the problem was with google?

Also, are the google DNS servers not as reliable and good, amazingly awesome as I have been led to believe over the years? What would be a better configuration for me to use to prevent this from happening again?



No comments:

Post a Comment