Saturday, February 29, 2020

Intermittent packet loss only to certain IPs over point-to-point

I have been trying to figure this one out for almost a week and it's one of the most confusing things I've ever seen.

We have Comcast ENS connecting 5 sites to one another, and starting early this week, all of the sudden we are seeing severe packet loss (20-50%+ depending on destination/time) to and from certain hosts at our main site. The Comcast router is an ISR4431, and our core switch is a Catalyst 6500. The output from a remote Cisco device to these hosts will show pretty consistent, intervaled loss (e.g. !!!!!.!!!!!.!!!!!.!!!!!) where only every so many ICMP packets are dropped, anywhere between every other and every 10-15 packets. The frequency of dropped packets also generally increases with ICMP packet size, so it seems to either be dropping after so many bytes or so many milliseconds.

The strangest thing is that the packet loss is only to half or less hosts at our main site, and basically 0 loss to other hosts. The ones that exhibit packet loss seem to be consistent across sources at the other remote sites, and there is no loss between one remote site and another, so we feel that we have isolated the problem to something at our main site. Comcast did L2 tests between their switches and found no drops, errors, etc. including by the BUM filter.

We unfortunately do not have direct access to the ISR4431 as it is managed by a third party, but I did get them to send us a show tech-support output. I found some posts about a Catalyst switch having stuck-open TCP sessions where they were seeing the same kind of intermittent packet loss only to certain hosts, but doing a "show tcp brief" on the 4431 and our 6500 only showed a couple open sessions each.

We are kind of at a loss here, and aside from rebooting the ISR (which we can do, but have to schedule an outage wndow) don't know what could be causing this. Routing? OSPF routes are /24 and different hosts in the same subnet/VLAN are showing no packet loss or quite a lot, so I don't think it's that. Some sort of ARP issue? Any assistance/ideas would be awesome.



No comments:

Post a Comment