Tuesday, August 4, 2020

Dropping packets on specific VLAN only

I'll preface this post with the fact that i am a sysadmin, and not a network engineer. I'm pretty comfortable when it comes to switching, but rarely deal with configuring or troubleshooting routing.

I have an issue where only my server vlan, VLAN 100, will drop traffic to the internet, or our VPN clients (VPN users). If i do a continuous ping, in the course of 5 minutes it will drop 30% or more of the traffic. It will ping 8.8.8.8 4-5 times successfully, then fail, sometimes twice, then another successful ping, then another fail. It only does this from devices that are in VLAN 100. If i take a test laptop, plug it into one of our Rack Switches (Cisco 3650) and put it on VLAN 100, i get the packet loss i described above. If i switch that same laptop to VLAN 42 (our client VLAN) on the same switch port, no packet loss to the internet.

Does anyone know what would cause traffic from a specific VLAN to drop this much out to the internet, but not internally?

I can ping from VLAN 42 to any of our servers in VLAN 100 with no packet loss internally. Same goes pinging from VLAN 100 to VLAN 42, or VLAN 20, or any of our other internal VLANs. No packet loss. The only internal VLAN where i get packet loss is VLAN 9. This is our VLAN for VPN connected users. When they get an ip address internally once successfully connected, it's on VLAN 9. If i ping from VLAN 100 to clients in VLAN 9 on the VPN, packet loss (30% in 5 minutes). However, i can ping VPN users from VLAN 42 to VLAN 9 with 0% packet loss. Same thing in the other direction.

I did not configure any of the routing being done on our core switch. This was done by my parent company. They are in another country though, and have been completely unresponsive going on a month now due to COVID.

The issue first presented itself as slowness and locking up of email while on VPN. I noticed it got worse for users that i migrated from Exchange 2010 to Exchange 2016. I believe this is because Exchange 2010 uses MAPI for communication, and Exchange 2016 uses HTTPS for it's communication, which appears to be much more sensitive to dropped packets and broken communication.

i narrowed it down to VLAN 100 by pinging to the mail server from the VPN clients (VLAN 9). I noticed packet loss (30% in 5 minutes). Then i pinged other servers in VLAN 100, same packet loss.

Things i have tried so far with my limited skills and abilities:

  • tested our AT&T ISP connection. No dropped packets when plugged directly into the AT&T router. No dropped packets when pinging from my workstation on VLAN 42, our normal client VLAN. VPN client's don't ever get disconnected from the VPN itself.
  • Packet captures were done with our 3rd party vendor on our firewall. Packets on the server VLAN out to the internet are being discarded before they are even reaching the firewall.
  • This lead me to believe it's the core switch. i sent "show Technical-info" outputs to my 3rd party support for our Cisco gear, but no responses back on what the problem could be. It's been a full week since i have heard from anyone.

It's my firm believe that the core switch is discarding these packets from VLAN 100 to the open internet for some reason or another. It shouldn't be link saturation. We have dual Cisco 6880 switches in a VSS cluster to handle the traffic. We have about 80 Vm's, most of them single app servers not doing much. Plus our monitoring isn't throwing any flags on throughput. Plus it ONLY affects VLAN 100. That's what blows my mind.

What would cause a switch to discard packets on a specific VLAN but only out to the internet?

Thanks for reading. I had to get this out there. I have felt pretty alone in dealing with this issue.



No comments:

Post a Comment