Friday, February 5, 2021

DNS: direct nslookup resolves, dig does not

My company makes a device which is placed at remote customer sites on networks we do not control. The device communicates with our cloud service over HTTPS. Simple stuff typically.

Occasionally we run into network issues as our customer's networks are often locked down hard and their IT staff doesn't always have skill. We get called and there's a period of blaming our device for the problem before we figure out some bit of configuration they got wrong.

But today we ran into one that I haven't been able to figure out. DNS is not resolving. Our software, written in python2 (yes we have a python3 port complete but the upgrade hasn't made it to this site), gives a "Name or service not known" error.

The device runs a slightly older Debian (9.0, stretch), with connman installed. Connman replaces /etc/resolv.conf with a symlink to /var/run/connman/resolv.conf, which points to a local DNS resolver on 127.0.0.1. Connman's resolver is bound to this address on the correct port.

connman is correctly configured with their two DNS servers. I can use nslookup to query both these servers directly for our domain (we'll call it "my.company.com").

dig +trace my.company.com (querying through the connman resolver) returns a SERVFAIL after about 4 seconds. To me this means connman's forwarding mechanism is responding, but the forward is timing out.

This same device works fine on hundreds of other sites, so we immediately suspect something odd with this network. But what? An nslookup query to their DNS servers works fine.

And this customer has TWO of our devices exhibiting the same problem.

How might I further debug the issue? Any ideas on what might be going wrong?



No comments:

Post a Comment