Thursday, July 1, 2021

Automatic rerouting/failover around upstream ISP issues

I'm looking for recommendations on ways to automate "failover" of traffic between carrier-redundant internet circuits when the issue is further upstream in our primary ISP's network.

Our current setup: enterprise network with ASN and PI prefixes, two internet circuits on different carriers, all-BGP edge environment, "active/passive" design for path selection, using higher local preference internally on primary for outbound path selection, and BGP communities via outbound route maps with each carrier to influence their LPs for inbound path selection.

Today we had an issue with our primary carrier where our circuit and their metro area were all operational, but they had issue with interstate backbone that led to roughly 50% packet loss / 90% throughput reduction - a bad time for our users. But because BGP neighborship stayed up and default route still advertised to us, our routers were blind to the issue and so no rerouting to our secondary occurred. Because there was another ongoing unrelated incident tying up our on-call resources we were slow to notice the problem so by the time we identified what was happening and got traffic rerouted over to the other carrier we had enough users blowing a gasket to turn this into a Big Deal™. About 1hr from on-set to workaround (would have been faster if we had a pre-set runbook for manual reroute).

Are there any common, reliable (and ideally free but don't want to be a cb here) solutions to automatically identify upstream ISP issues like this and automatically adjust routing accordingly in order to more rapidly respond to incidents like this? We're running Cisco ASRs on our edge if that makes a difference.



No comments:

Post a Comment