Thursday, November 5, 2020

Advice on moving to redundant, multiple uplink router strategy.

Here is how our network currently looks: https://imgur.com/a/LkEDJNT

We've basically got one linux-based router, with a four-port NIC connected to four different switches, configured to bridge into one connection on the router. We've also got a set of switches that are for internal traffic that are not represented here (we also have far more servers than represented here, but the graph gets to messy).

We've got full BGP tables from two providers, and then the third is an exchange where we are peering with participants.

We are upgrading to a 10gbe link on one of the connections, and want to move to a redundant router setup. Our goals would be to eliminate SPOFs, have some ability to engineer traffic to balance the costs with different providers, make things easy to manage (we do not have a lot of routing knowledge, and the bus factor increases every time we add new complications to our routing setup), be relatively low cost (we are running linux routers), and finally we want to be able to build something that will not cause us problems when we need to scale.

There seem to be a few ways that we can go:

  1. each router is connected to each of the three providers. This would mean two connections to each provider. This doesn't scale well, and might be cost prohibitive (depending on what the transit providers would charge us for the additional connections).

  2. we divide the transit connections between the routers. We could simply just lose one, and the traffic would then go to the other. Can we do this without iBGP? Like configure two static routes on our servers to handle this?

  3. we do the same thing as #2, but with iBGP and a cross-connect between the routers. If I'm not mistaken, we'd need all of our servers to run some internal routing daemon. We'd also probably want our switches to be able to do L3 OSPF, which we currently do not have available (we have L3 RIP capability in the switches).

  4. we have a router on standby, with VRRP to fail-over if the hardware dies, moving the connections to the other router. This would only solve the SPOF for our hardware, and not for the links.

It seems like these are the options, can you suggest others?

From reading posts here, it seems like #3 is really the only viable way to go, but it introduces a lot of additional undesirable factors (switch replacement, complicating management with needing routing daemons on all of our servers, iBGP and OSPF cognitive load).



No comments:

Post a Comment