Tuesday, July 10, 2018

Internet Routing Table Size and Fail-over and BOOM! x-post from /r/Juniper

Okay, so I have two edge routers with 2 providers each. (MX480s with RE-S-1800x4 and MPCE Type 3 3D) I am receiving a full routing table from all my providers. One is Level3 the others are regional providers and the data center. Level3 is where a majority or our traffic defaults to.

Today was fail-over testing day and it did not go as smoothly as expected.

Taking one of the routers offline (The router connected to Level3) went just fine but bringing it back up caused about 5 minutes of downtime.

From the Internet inbound traffic died at the hop before my routers. (Assuming my router didn't have a route back or it was looping internally)

I waited about 20 minutes between taking the router offline and bringing it back online to avoid any upstream dampening.

Internal to the internet traffic was looping between my routers.

I checked the routing table and had active and best routes to my destinations on both routers. I did NOT get to check the details of these routes or the actual forwarding table. My best guess is that routes were being shared between my routers BEFORE they were installed to the forwarding table. I am guessing this because I have had this same setup for years but the last time we did forced fail-over testing the routing table was only 500k routes vs the 700k it is now. Also, during this time the routing engine CPU was pegged at 100%.

I started Googling and found this BGP nob. https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/delay-route-advertisements-edit-protocols-group-family-unicast.html

My question is has anyone seen this kind of behavior before? Does my assumption make sense? I would test it further but customers will be upset. Has anyone used this nob or similar?



No comments:

Post a Comment