Wednesday, July 18, 2018

Fortigate VRRP issue (kinda long)

I'm still investigating this, but I thought I'd post here to see if anyone has an idea of what's going on.

I have two Fortigates at the edge of my network. They are BGP peered with providers on the WAN side, and the internal network is pretty simple. Here are some details about the setup:

  • There is a VRRP address configured on the inside, which any servers that talk to the internet use as a default gateway.

  • They are not configured as an HA "cluster" with FGCP, but have some HA features enabled, such as session synchronization.

  • They are intentionally setup to be able to route asymmetrically, and this works fine - if I move our public IPs to the secondary with BGP, but leave the gateway address on the primary, everything works as intended.

  • They each use a redundant interface to connect to a pair of leaf switches (not the same pair).

So the problem is, when I change the VRRP priority on either router to try and move the gateway IP to the secondary, shit breaks. I can no longer reach the internet using that gateway, nor can I ping the gateway IP (no response). It's like a black hole. If I check the ARP table on the server, and trace down the gateway MAC, it goes through the correct switches and is reaching the intended unit.

A packet capture on the unit shows ICMP echo requests, but no responses. I have verified there are no firewall or routing issues. When using "debug flow trace" to troubleshoot, there is no output when filtered to the VRRP IP or the machine I'm pinging with. It's as if it's being dropped by the interface before the kernel sees it.

However, where it gets really strange is when I change the priority back to move the IP back to the primary unit. The secondary immediately starts responding to pings for just a couple of seconds. Basically long enough for the gratuitous ARPs to propagate and move the IP back over. And then of course, everything continues to work like normal when the IP is back on the primary.

Am I missing something stupid/obvious? Or does this sound like a more complex issue with the network?



No comments:

Post a Comment