Friday, March 2, 2018

Detection of broken forwarding planes

Hi all,

We have experienced a problem where one of our core router was causing high latency. The vendor couldn't find anything. We eventually found the problem in one of the hardware fabric modules that handle the forwarding plane. This issue was already present in our network for 2 days, we didn't notice because there were no links that failed, no errors, no logs, nothing. Of course Google's traffic controller noticed higher latency in their applications and steered traffic away from the affected links, which helped us pinpoint the problem.

What systems do you currently use to track the health of your network? I'm thinking about something with these capabilities: * External probe appliances with 10G connectivity (our cores don't have 1G). We want to monitor around 150 nodes, so thinking of the same amount of probes. * Automatic correlation to pinpoint a failing link/node. * Ad-hoc analysis: start latency test, traceroute from all or some probes in the network to a specific IP. (we sometimes get complaints from residential users that experience higher latency to gaming servers, etc. Such a solution would cut down the time to analyze these issues)

Was looking into ThousandEyes, but I'm somewhat lost in their product portfolio. Thanks for sharing your experiences.



No comments:

Post a Comment