Friday, April 27, 2018

ISP and Sonicwall each blaming the other for dropped link

One of my managed service clients uses a local fiber ISP for their primary internet access. My company uses Sonicwall at all our client sites, generally with no issues.

Sometime near the beginning of this year, the fiber link stopped routing traffic. The link to the Sonicwall TZ400 was live, and an IP address was pulled, but none of our traffic reaches beyond the ISP's gateway. This of course caused the Sonicwall to switch to the backup link from another (much slower) ISP, where it stayed until I manually disabled the failover configuration and tried sending pings and traceroutes across the problem connection. Suddenly they started going through and everything was fine, so I switched everything back and wrote it off as an ISP hiccup.

3 days later, same thing happened again. Same fix again.

At this point, I contacted the ISP, who (predictably) blamed our equipment, despite the fact that we can ping their gateway. After some back and forth with them, we got the link back up and running.

Over the next weekend, same thing happened again.

Long story short, the ISP finally did some troubleshooting, and is absolutely adamant that the problem is NOT on their end. Their rationale is "we have hundreds of clients on this same equipment, same configuration, and you're the only one experiencing this issue." The issue persists to this day, recurring about every 3 days or so, though sometimes it runs as long as a week and sometimes as short as 1 day.

Here's what has NOT permanently fixed the issue so far:

  • Disconnecting or rebooting the Sonicwall doesn't even bring the link back up.
  • Getting a new static IP from the ISP, or a sticky DHCP address brought the link up, but it drops again after a few days.
  • Replacing the Sonicwall with another model changed nothing.
  • Factory resetting and redoing the Sonicwall configuration from scratch changed nothing.
  • Following the recommendations in this article, as recommended by the ISP, changed nothing.

Here's what brings the link back up (for a few days):

  • Changing the MAC of the Sonicwall so it acquires a new address via DHCP. The new IP routes traffic just fine, after which I can switch back to the old IP and it will work fine, too.
  • Sending pings or traceroutes across the problem connection. They'll fail to reach beyond the ISP's gateway, but a few minutes afterwards the link will start routing traffic again.

When I arrived to replace the Sonicwall for troubleshooting, the link was down, and remained down after I physically replaced the device. Only when I started sending pings across the link did it come back up.

I brought Sonicwall support into this during one of the outages. After spending a good hour capturing ARP traffic and verifying that they could, in fact, reach the ISP's gateway, they said they'll need to work with the ISP and figure out what's happening to the traffic on their end. They confirmed that my config is good, and that there's nothing on the Sonicwall that they're aware of that could be causing this issue.

The ISP continues to insist it's not their end, but is willing to talk to Sonicwall directly about this to try and get to the bottom of it. Right now I'm just waiting for the link to fail again before I get everyone on the phone.

I'm not new at this, and everything I can see tells me that the issue is 100% on the ISP side, but they have a good point; if it's their end, why are we the only ones with a problem?

I'm out of ideas. Has anyone else run into an issue like this before?



No comments:

Post a Comment