Friday, December 29, 2017

Need help troubleshooting "the Internet."

So, we have a number of remote workers. They each have a little VPN router (Fortigate), a VoIP Phone, and a desktop provided by us.

Several of our users have started having really bad problems with connectivity, around 2-3 months ago persisting to this day. During certain parts of the day. Their web based apps on the desktop will slow to a crawl, their phone calls will start sounding garbled and cut in and out (so jitter + loss basically), and other apps they use become unresponsive and slow.

I started mapping the users who frequently have the problems, and the thing is they aren't in the same region, they don't have the same ISP, they don't even have the same "type" of Internet (2-3 are on DSL, another 4-5 on Cable, one even has a "Fiber" Internet to their house supposedly)

The Internet pipe that their VPN routers talk through back at the data center is not saturated during these incidents. We have no link saturation in our backbone either. During one of the episodes we monitored the whole end-to-end path between the VPN head end and VoIP server delivering the audio packets, nothing was saturated and we could follow the path via NetFlow to confirm yes the traffic is going this way.

EDIT: and yes we have QoS and confirmed that the correct counters were incrementing, proving the traffic was treated as EF all the way to our edge, and the traffic coming back from the remote user was also marked EF when it arrived here.

All health checks on the VPN head end and the individual endpoint routers come back clean. No memory or CPU spikes, no syslog errors... nothing.

When it's happening and you remote into their router, it's horrendously slow.. even typing into the CLI the words you type show up like a minute later after you've typed them, and the SSH session will just drop and you have to reconnect.

We all feel pretty strongly that the problem is "out there on the Internet" somewhere between them and us. But there doesn't seem to be any kind of evidence proving this. Like, I wouldn't feel confident opening a ticket up with our provider on that Internet Circuit because we have no real evidence showing there's a problem... and for all we know it could be dirty fiber at one random Internet Exchange somewhere out there I don't know how we can isolate the problem.

Is there some kind of tool or method we should use to get more data?

We usually can't ping their outside public IP, since the modem delivered by their ISP usually blocks inbound pings anyway... traceroute to their outside public IP shows no unusual latency at any hop either. Like wtf is going on?



No comments:

Post a Comment