Friday, February 22, 2019

Aruba controller trunk port changing to access causing outage

Hi,

We have Aruba wireless with controllers split across two core routers in data centres with L2 connection between both data centres. Our IP addressing is handled on the core routers rather than the controllers and we use VRRP between the two DCs. We recently needed to create a new VLAN for wireless clients and a new subnet on the core network for these users. We created the new VLAN on our core network and the IP interfaces and VRRP setup, then tagged the controller uplinks with the new VLAN. Basically followed current design we have not trying to do anything different.

All was well, then we created the VLAN on the controllers and then once we added the new VLAN to the list of allowed trunk VLANs on the uplink port channel on one of our controllers everything died for clients connected via APs terminating on that controller (SSIDs were still broadcasting etc). We tried to troubleshoot as best we could to find the cause but ultimately ended up reverting back by deleting the VLAN off the controllers and disabling the VLAN on the core routers and then it came back.

We noticed two main things when the outage occurred, one of our VRRP instances failed over when we made the change but we are not sure why. The other is that when the outage was occurring we could still connect to WiFi and get access over a particular VLAN. The strange thing is when we look at our config on the controllers in the CLI, this VLAN is listed as the access port default VLAN for the uplink port channel (1005 in the example below).

interface port-channel 0 add gigabitethernet 0/0/2 add gigabitethernet 0/0/3 add gigabitethernet 0/0/4 add gigabitethernet 0/0/5 trusted trusted vlan 1-4094 switchport mode trunk switchport access vlan 1005 switchport trunk allowed vlan xxx, xxx-xxx, xxx

The switchport mode is trunk but it is as if when we make the change it flips to being access but still shows as being trunk in the GUI and CLI. The VLAN in question is one that I added a while back in a similar change in which we also had a very brief outage but by the time it was reported it had resolved itself without any intervention so was not able to diagnose anything. Checking our other controllers it appears as if the last VLAN created on each is populating this line in the config but no one would have intentionally configured it that way.

There were a load of logs and errors at the time on the controllers and core but so far our support partner has been basically useless and wants to close our ticket because the outage is over rather than find the root cause. As this has occurred twice now we are very reluctant to make any changes to the controllers before we know what is up.

Any help appreciated. Thanks.



No comments:

Post a Comment