Wednesday, June 9, 2021

6 remote sites but only 5 remain online at any given time

Hi all - first post here!

I've been troubleshooting a strange error for the past week which I'm calling an exclusive token ring issue. I have 6 remote sites, only 5 of which come online at any given time. If I reboot the offline one, it will come online and then another random site will slowly die. More details below...

I have a Palo Alto FW acting as the route point for 8 different VLANs. I'm sending 7 of those VLANs down a trunk with the 8th acting as a native VLAN. Next stop is a KG-175X encryptor (with an IP on the native VLAN), which is shipping each of those 7 VLANs out to each of my 6 remote sites using a manual PPK SA (with an appropriate PPK chain and multicast group configured).

Next stop for each remote site is a KG-175G encryptor (with an IP on the native VLAN) with the same manual PPK SA (with the same PPK chain and multicast group configured), which then ships all 7 VLANs down to a 2960x (with a management address on the native VLAN) which distributes them out to machines on each VLAN at each remote site.

I've purposely omitted detail on my black side infrastructure as I'm pretty confident there's no issue here. My multicast group on my black side/core infrastructure looks good - I see each encryptor's black side IP join the group with the appropriate RP, etc.

However, on my red side, only 5 of the 6 sites are up at any given time. If I reboot the offline site's KG and switch, that site will come online, and then another site will slowly die. It feels like some kind of weird PPK/multicast limit, but according to the manufacturer's documentation, I should be able to push something like 64 VLANs through those encryptors without issue.

Intra-VLAN pings between machines at the local site work. Today, I hooked up a tap to various ports on the 2960x - I can see my inter-VLAN pings coming in to the offline site, however none of the 6 machines (all Windows 10) attached to the various VLANs on that switch will respond to pings, even with static addresses set. For a kicker, machines will randomly work at various times throughout the day, but subsequently die without warning/as mysteriously as they came alive. The switch at the offline site knows the MAC address of the Palo Alto VLAN interface serving as the gateway for each of the VLANs, but it's almost like the machines don't think they do and thus don't bother sending any inter-VLAN traffic out.

I know this is long, but I'm kind of at the end of my rope here - has anyone seen anything like this or does anyone have any ideas?



No comments:

Post a Comment