Saturday, October 2, 2021

Detecting and mitigating BGP peer black holes

We're a small regional ISP and data center. We have several upstream bandwidth providers and networks we peer with. One of the bandwidth providers we peer with on a 10G link recently had a power failure, and their link went down, no big deal, BGP handles that just fine.

2 days later we started to see 35% of our traffic dropping. After investigating for 10 minutes, it became clear that traffic we send to them or traffic reaching them via BGP looking to hop into our network was being accepted and then dropped, creating a traffic black hole.

Because the BGP sessions weren't flapping, flap protection didn't kick in, and because there's no downed link, BGP didn't bypass the link.

1) There's got to be an elegant way of handling this without manual intervention? Massive networks with hundreds of similar providers can't be managing the quality of those peering relationships manually

2) Are there route table rules that can detect these situations and downgrade it's weight to not get used?

TIA!



No comments:

Post a Comment