Saturday, December 14, 2019

VoIP troubleshooting, what are we missing?

My team and I (small WISP) have been troubleshooting some over the internet VoIP issues for a couple of customers throughout the past few weeks and are running out of ideas as to what the problem could be and/or things to test. Looking for a sanity check here as well as ideas.

Also I should throw it out there that most will read this and think, wow this guy went way deeper than he should have, but, we truly care about our customers and the experience they have with our network and our team. For us, it’s worth the time investment.

Scenario:

A customer called and mentioned that people on the other end of the line can not hear them. Sure enough, if they call me, or if I call them, the audio is very choppy. They can hear my fine though. Their phones are hosted with Nextiva and they get a 50x50Mbps connection from us. They rarely peak to even 20Mbps up/down so plenty of headroom.

What we’ve done so far:

We’ve ran some Ping Plotter tests to their CPE’s management IP, as well as to their firewall’s WAN IP (public IP /30, routed to the internet, no firewalls in the data path on our end, Cisco routers only). Packet loss fluctuates between .1% and most of the times it’s 0.0% according to PP. Ping interval is .5 seconds.

We did notice a few errors on the CPE’s (UBNT PowerBeam iso 400 ac) ethernet interface so we had the cable replaced/certified just to be sure.

Customer’s MSP (we have a great relationship with them) has replaced firewall with 3 different models to rule that end out. They’ve also verified no errors or PL on the LAN end. We’ve even plugged a phone directly into their switch to eliminate internal wiring.

We also got the vendor, Nextiva, involved to validate their config on the firewall and network. In addition, we’re running a ping plotter trace to the Nextiva IP where the phones connect to, its a ruler flat 25ms, never any packet loss. Worth mentioning though the router/interface where our upstream (HE) and Level 3 peer at drops around 80% of the packets. HE confirmed our suspension that this is indeed control plane policing, which were no worried about since the traces to the Nextiva IP are perfect.

FWIW - Nextiva’s highest tier of support is clueless for what it’s worth. They told us there must be firewall rules on the switch (flat network, L2 only switch) that are causing the problems 🤨.

We ran a call quality simulator/test tool from the customers computers to ring central and it came back with a perfect score in every area, 1-2ms of jitter, near perfect MOS. Obviously this is a different data path, but, does validate that it’s likely not an on net issue on our end.

In addition, we made concurrent phone calls from the customers desk phone to our office and a PSTN call from an app on my mobile connected to their WiFi (Microsoft teams). The call on the Nextiva phone of cut out and the Teams call was crystal clear. In addition, if the customer makes a call from the Nextiva app it’s clear both ways.

We have a tool that actively scans our network to look for interface errors on our route/switch/wireless network and it’s 100% clean. I even did a show interface on every link along the data path to confirm no output drops, CRC errors, queuing, etc...

So at this point we thought, ok, this has to be on Nextiva’s end. Until... we got two other customers complain about similar issues and they’re on different towers, with different VoIP providers. Now it’s possible these are just coincidental, though we very rarely get any support calls at all (customers are all business customers) so the fact that multiple people are calling in has us thinking it’s something to do with our network.

Theories/Suspicions/interesting observations: We monitor our connectivity to popular sites like google, Facebook, etc... HE has been having saturation issues with Google recently, latency will go from 8ms to 200ms for a few hours. They acknowledge this and are working on it. The thing that doesn’t add up here is the Ping Plotter traces are perfect to Nextiva.

Ping plotter claims that even though control plane policing may be show loss at some routers it still maybe indicate an issue. I wonder if our upstream or maybe even level 3 is prioritizing ping to skew the results?

I’m going to do a wireshark capture on the customers port when they make a call to see if the calls are maybe hitting a different IP address other than the phone provided by Nextiva. Given their piss poor support, I’m skeptical that he address they gave us is correct. I’m wondering if maybe the SIP traffic goes to one IP and audio stream to another.

Anyhow, if you got his far, congratulations! Any input/ideas/questions are all appreciated.



No comments:

Post a Comment