I'm trying to pinpoint a reason I'm seeing intermittent bursts of overrun errors on a ASR1002. Quite large bursts as well up to around 400,000 in a short 10 minute window.
The ASR1002 has the esp-10 and 2 x 10Gb line cards. IT just has one 10Gb uplink and one 10Gb to our supplier on which we have a few hundred customers.
Anything you look at online says that overrun errors are caused by the router receiving too much traffic that it can't process in time. We've been monitoring the traffic very carefully and we are averaging around 1.1Gbps to 1.5Gbps through the device.
Now it's possible there is some very bursty traffic causing this but we haven't been able to spot it with Solarwinds or netflow enabled on the router. But to be honest even if it was a sudden microburst I don't think it would take it above the routers capacity and all the customers which hang off the router have designated bandwidths so they could only burst up to their allocation. Still it's possible and something we are looking at.
As we couldn't find what was responsible our next thoughts were:
Faulty optics,
Faulty Line cards,
Faulty fibre, and finally faulty router.
They've all been replaced and still we see the overrun errors. It happens maybe 4 times a day (some days not at all) and over a day we can run up to around 1 to 2 million overrun errors.
I've got a Cisco TAC case ongoing but they are being pretty useless with finding the cause. They keep basically reading off the Cisco literature and advising that the router is hitting capacity despite them enabling an event manager script to capture traffic on the router and it not actually finding anything of significance.
So i've bounced it back to them several times and it's still in their hands.
One thing I have noticed which may not be of significance is when I compare this ASR1002's 10Gb link to our provider with our others is that this 10Gb has 'route cache' counters incrementing. Not a huge amount, about 5 a second but I don't see this happen on any of our other ASR routers which have identical setups and similar throughputs.
All my reading on the route cache doesn't really point me to an issue but I can't figure out why this one would be incrementing. We've gone down the line that maybe our 10Gb provider is having issues and it's causing buffers to fill up between our ASR1002 and their equipment which then causes the overruns.
Our last direction we are looking into is if someone is sending a certain type of traffic through the ASR at certain times of day which is causing issues. This is where we've enabled NETFLOW to try to see if there is a pattern to what data is going through the ASR when they netflow events occur.
So far not pattern that we can see. We see some high amounts of ESP traffic going through but nothing crazy or of concern.
Looking see if any of you guys/gals may have experienced anything similar? Thanks
No comments:
Post a Comment