We are a municipality with 16 city buildings connected to a central WAN core switch via private fiber. We recently migrated our WAN routing from OSPF to BGP. Things got a little messy, we identified the issue, however whenever I attempt to fix it, things get ... weird. Let me explain...
Each remote site has a L3 core switch where the local building VLANs reside, and a point-to-point transit VLAN back to our WAN core over BGP.
The remote sites configuration is nearly identical across the board. There is a local VLAN 1, a local VLAN 50, and then the point-to-point VLAN to the WAN core where the VID is different at each site. The local VLANs are not tagged on the uplink to the WAN core, only the WAN VLAN is. The local data and voip subnets are redistributed into BGP, and the remote site core switches are doing ip routing.
for example:
City Hall Core
VLAN 1 (Data) - 10.20.1.0/24 tagged on trunk to distribution switch
VLAN 50 (VOIP) - 172.20.1.0/24 tagged on trunk to distribution switch
VLAN 201 ( CITY HALL TO WAN) 10.255.254.1/29 tagged on trunk to wan core
DPW Core
VLAN 1 (Data) - 10.20.2.0/24
VLAN 50 (VOIP) - 172.20.2.0/24
VLAN 202 (DPW TO WAN) - 10.255.254.9/29
Then, on the WAN core, we have only the PTP VLANS:
VLAN 201 10.255.254.2/29 tagged trk1
VLAN 202 10.255.254.10/29 tagged trk2
We also have a VLAN 1 here, say 192.168.0.0/24, which is used for some servers and firewalls.
I'm aware that using VLAN 1 is bad practice, and having the VLAN IDs the same at each building is also bad practice. (That's the way I inherited this network... I intend to change each LAN and VOIP VLAN to a unique VID). It's important to consider though that this was not an issue prior to this migration.
After completing the migration, we started to notice clients were receiving DHCP leases from the wrong building. After looking into it, I found that one of the technicians who was working on the remote sites accidentally untagged VLAN 1 on the WAN trunks for some of the remote sites he was working on. (on the WAN core side as well). That explains why they were getting DHCP from the other buildings, VLAN 1 was a connected VLAN.
So that got ugly, but it seemed like a straightforward fix - I just had to go in and ensure all the VLAN 1 and VLAN 50's were not untagged (or tagged) on the WAN trunks.
So, I started to do that - I found the first problematic site and made the config changes accordingly. When I did that, though, all of a sudden chaos ensued. VLAN 1 at one site was not able to route to VLAN 1 at another site. It's almost like the switches were still trying to route traffic over the connected VLAN 1 and not routing it over BGP. Tried reloading the remote switches, clearing arp tables, no luck.
The routing tables all looked fine after I made that change - the local subnets were no longer showing connected and showing up in the BGP routing table. Despite that, VLAN 1 at one building was still not able to access VLAN 1 at the other site. It's inconsistent and seems to only affect certain IP addresses within the subnet.
I feel like if I could shut down the entire city, fix up every config, and then turn the city back on, all these issues would clear up. But that's not an option as we have police, fire and 911 working 24/7.
I'm really just looking to get some insight here. What am I missing? Why would these issues persist after essentially isolating the local VLANs as they should be?
If anyone cares, I'm happy to send over the actual configs and documentation with all the relevant info.
Thanks in advance...
No comments:
Post a Comment