Thursday, March 28, 2019

Phones randomly dropping VLAN tags via LLDP-MED

I've resisted making a post about this, but I think it's time to ask the giant brains of /r/networking for help because Google has failed me and, support-wise, everyone is pointing at everyone. I apologize for the length; I'm trying to preempt questions where possible.

tl;dr Our Shoretel phone system - or our switches - has randomly starting dropping VLAN tags from the phones. It's either a config issue on the phones, or a bug in the switch.

Question: Has anyone seen this behavior before?

Switches: Netgear M4300 Prosafe switches (12.0.7.10)

Phones: various Shoretel models (mostly 230g and 480g)

I use a standard config to separate our phone/LAN VLANs.

#show running-config interface 1/0/1 !Current Configuration: ! interface 1/0/1 voice vlan 10 switchport mode access switchport access vlan 20 

Assume:

VLAN 10 = Voice = 10.10.10.0/24

VLAN 20 = LAN = 192.168.20.0/24

The config works well and does what you'd expect:

  1. Phones receive the voice VLAN tag for VLAN 10 when connected to the switch
  2. They DHCP to VLAN 10 via IP helper, pick up option 156 to download the FTP server info and to tag themselves for VLAN 10
  3. Download config

Everything works fine...

#show lldp remote-device 1/0/1 LLDP Remote Device Summary Local Interface RemID Chassis ID Port ID System Name --------- ------- -------------------- ------------------ ------------------ 1/0/1 282 10.10.10.10 <-- Good! 00:10:49:xx:xx:xx Serial Number: ... #show lldp med remote-device detail 1/0/1 Local Interface: 1/0/1 Remote Identifier: 282 Capabilities MED Capabilities Supported: capabilities, networkpolicy, location, extended-pd MED Capabilities Enabled: capabilities, networkpolicy, extended-pd Device Class: Endpoint Class III Network Policies Media Policy Application Type: Voice VLAN ID: 10 <---- Good! Priority: 5 DSCP: 46 Unknown: False Tagged: TRUE <---- Good! 

...until it doesn't. At random occasions, the phone just drops the VLAN tag entirely. I cannot find a root cause for this behavior.

During boot up, the LLDP-MED output remains the same as above, and then it just....drops the VLAN tag altogether.

#show lldp remote-device 1/0/1 LLDP Remote Device Summary Local Interface RemID Chassis ID Port ID System Name --------- ------- -------------------- ------------------ ------------------ 1/0/1 282 192.168.20.20 <-Bad! 00:10:49:xx:xx:xx Serial Number: ... #show lldp med remote-device detail 1/0/1 Local Interface: 1/0/1 Remote Identifier: 282 Capabilities MED Capabilities Supported: capabilities, networkpolicy, location, extended-pd MED Capabilities Enabled: capabilities, networkpolicy, extended-pd Device Class: Endpoint Class III Network Policies Media Policy Application Type: Voice VLAN ID: 0 <-------- Bad! Priority: 5 DSCP: 46 Unknown: False Tagged: False <------ Bad! 

I've tried the following Very Basic troubleshooting steps:

  • Factory resetting the phone
  • Bouncing the switch port
  • Disabling LLDP-MED on the switch and re-enabling it

However, if the switch is rebooted, the issue resolves itself. Everything comes back up fine. This is obviously not a preferable fix, made worse by the fact that we have some server clusters that crash if the switch they're connected to reboots, which involves MORE work to power off VMs and then re-enable them. And I can only stay up past midnight rebooting random switches to fix a single stray phone for a certain period of time before I have a mental breakdown.

The chatbot I spoke to said updating to 12.0.7.12 "may" fix the issue, but they have no idea. I've updated a couple of affected switches as a testing bed. So far, no issues, but it's only been a couple of days - this doesn't seem to occur until at least few days after a reboot.

I'm leaning towards it being a bug in the switch but my debugging options are limited on the switch side with relation to LLDP. I'm going to see about enabling syslog on the affected phone(s) so I can see what happens on that side with the VLAN tag, assuming a pertinent log entry even exists.

I am aware of other implementation methods if this issue cannot be resolved, so I don't need alternative methods to implement this (thank you, though!). I prefer this method as it's the cleanest config-wise (when it works, which 98% of the time it does), and I will migrate if I have to - this isn't a call center or anything, so at worst it's a minor annoyance. Most importantly, the phone system is still functional and these incidents are incredibly isolated - maybe three or four every couple of weeks.

What I would like to know is why is this happening? Has anyone run into this kind of issue?

You folks are great and I enjoy reading and being part of this community. I hope someone has an idea because I am out of them at the moment.



No comments:

Post a Comment