Thursday, November 5, 2020

Load balancing via the network - VRFs, route-leaking, PBR, and NAT

Hi everyone,

I'm in the early stages of network design for a new application which my team will be hosting. Due to the latency needs of the application we are opting to implement the application specific load balancing (for lack of a better term) within the network infra itself rather than use a dedicated solution like an L7 loadbalancer (Haproxy, F5, etc). We have come up with a few ways to accomplish this but I'm having a disagreement on which is the best solution to move forward. I'm hoping for some clarity on a these design decisions as I am m possibly missing something here.

For the network infrastructure we will be going with a spine/leaf architecture using eBGP between all switches and servers. Customers (and various service providers) will have cross connects with us onto a set of leaf switches (customer leaf) peering via BGP and the application servers will be connected to another set of leaf switches (application leaf) with routing on the host, peering via BGP using FRR.

The basic premise is that we want to have all of our customers connect to a single IP address which with the actual destination being spread out across a number of servers connected to the fabric. There is an element of network automation here which involves dynamically moving customers between application servers based on application configuration that may or may not be relevant to this conversion (I can expand upon this more if needed but don't think it matters related to design decisions) . So far we have discussed 3 ways to accomplish this at the network level and are in disagreement on which is the best solution.

 

Solution 1: VRFs spanning the entire fabric with route leaking at customer leaf

The idea would be to have each customer peering in it's own VRF (2x bgp sessions per VRF - likey with a couple hundred unique customers connecting at max) as well as each single application server peering in their own individual VRFs (one BGP session for each with these application servers being used as the detination to the "load balancing" - say 20 to start). The applications servers themselves will all advertise the same IP address into the fabric using a dummy device which will be isolated into it's own VRF by the application leaf switches. We would then span these VRFs across the entire fabric using 802.1q on the links between the spine/leaf switches, with each VLAN being associated with a VRF.

At the customer leaf switch we would leak the approriate routes between the customer VRF and the VRF belonging to the application server that we want them to actually connect to and thing would route cleanly through the fabric within that specific application servers VRF.

Pros:

  • Pure routing through the entire fabric
  • Setting ourselves up to easily deal with scenarios where peers that we connect to for other services wish to advertise the same networks to us on each BGP session (currently a pain in our infra)
  • I'm fairly certain this would have the lowest latency as it is simply BGP routing end to end

Cons: - Potential limitation with the number of VRFs that we can use being associated with vlan limitations - As I understand using lots of VRFs can lead to memory issues on your switches

Disclaimer: This is my solution and I am biased towards as it's my own, so really pick it apart if I'm overlooking something obvious. I think that using pure routing through the entire infrastructure is elegant and a breath of fresh air compared to the years of NAT/PBR cruft that our current network has accumulated and turned into an administrative nightmare.

 

Solution 2: DNAT on ingress

The second solution would be to either scrap VRFs completely or to only use VRFs for our customers we peer with. Without using VRFs we would simply peer into the default table and then DNAT based on source IP address to the correct application server that we want the customer to connect to. If we wanted each customer in an isolated VRF we would leak the routes between the customer VRF and the default table at the customer leaf switches where the DNAT would then take place.

Pros: - Simpler confuration as a single change only needs to be made on the customer leaf switch

Cons: - I'm really opposed to using NAT as I think it's ugly and always seems to make the network more complex than using pure routing (the NAT in our current infra is a huge mess) - Adding latency where the NAT translation takes place

Disclaimer: I really dislike NAT. I understand it's importance and where it can be used, but I personally think it should only be used as an absolute last choice in any design decision. This bias may just be because of how much a mess our current infra is with hundres of NAT rules on every firewall/router which always end up causing issues as people add new rules to make something work without consideration to how NAT rules are processed, breaking 10 rules after it. Unless I can come up with some compelling reasons on why this is a bad decision, this is the way I'm being pushed to go and I'm very unhappy about it. Maybe you guys can change my mind and lighten me up to this type of design.

 

Solution 3: VRFs only on leaf swithes combined with PBR

This was the original solution we came up with that I'm not even sure is viable. The idea would be to only implement VRFs on the leaf switches and to leak routes between customer/application VRFs and the default table as needed, with things being routed through the infrastrucure on the default table. Each application server would still be in it's own VRF and would advertise the same IP address into the fabric. To get customer traffic to the correct server, we would use PBR on the application leaf switches to send traffic into the correct VRF. Thinking more about this now, I'm not even sure that this would work unless all application servers were connected to the same leaf switch since the PBR is happening all the way at the edge and traffic from the spine could potentially arrive at an application leaf switch that doesn't contain the correct application VRF...

Pros:

  • Eliminating spanning the VRFs across the entire fabric reduces the complexity of the configuration between the spine/leaf switches

Cons:

  • I'm not sure how much (if any) latency mixing in PBR would add vs using pure BGP
  • As stated above, I don't even know if this is a viable design choice given that there would be multiple application leaf switches

 

I know that the above is alot to take in and want to thank anyone who is still reading this far. Leaving aside all of the interesting automation considerations that we are looking at to tie the application mappings to network changes (which I can explain further if it helps understand what we are trying to accomplish), what are your thoughts on the above three designs? Am I being too biased towards my own design, hating too much on NAT where it may be a sensible option in this case, or do you guys see any glaring issues that I'm missing?

 

Thank you for any advice/help!



No comments:

Post a Comment