Wednesday, October 3, 2018

Reconvergence issue

so i have a peculiar but very impacting issue. We have dual L3 MPLS Clouds for redundancy and very low BGP timers for fast failure detection. We started to see that whenever our Primary MPLS circuit went down at any site, our CE would flush the routes and failover to the other MPLS cloud in about 10-15 secs but our other sites kept sending traffic to the downed circuit. Basically sending traffic to a black hole. 3-5 minutes later the rest of the sites would eventually flush the routes from the site and use the backup MPLS to reach the site. This Also affects any routing update, if remove a route from being advertised, it would also take 3-5 minutes to update everywhere else.

We did some afterhours tshoot and eventually saw that the local PE/CE flushed the routes right away when the BGP hold timer expired, now our SP was extremely skeptical on who was to blame. But they saw that the site route was not being removed on any of their PE's on a timely manner. Now their "solution" was to implement BFD to improve convergence. But now i am the one skeptical because BFD does not help to assure BGP routing updates get propagated. or am i wrong?

has anyone dealt with this issue before?

siteX ------------CE<-bgp->PE--- (MPLS Cloud) ----PE<-bgp->CE------------DC.site

10.x.x.x/24........failure.......................................(still sees route)



No comments:

Post a Comment