Friday, December 27, 2019

Weird ESXi Networking Issue

TL;DR

A VM on ESXi can't ping VMs in the same VLAN on another ESXi host, but can if vMotioned to a third host. Changing the virtual NIC's MAC address resulted in the same issue, but different source and destination problem hosts. Changing the virtual NIC a 3rd time and now it can't connect to VMs on any host on a different VLAN (but can reach other non-virtualized network devices and other VMs are able to connect as expected).

Long version:

I have 4 ESXi 6.5 hosts, each with 2 10Gb up-links, one each to 2 Cumulus core switches in MLAG. The core switches are the default gateways for the VLANs. Each VLAN is in it's own VRF, and inter-VLAN traffic is routed up to the firewall. We have several VLANs/VRFs, important ones are VLANs 1 and 4 which are trunked to all hosts.

Round 1:

I noticed our monitoring system (VM1, Solarwinds on Windows server 2016) which was hosted on host 1 in VLAN 4 was unable to reach any VM on host 3 VLAN 4, but could reach any other VM in either VLAN 1 or 4 on hosts 1, 2, and 4, and any VM on host 3 in VLAN 1. Also, other VMs on host 1 VLAN 4 could reach anything, including VMs on host 3 VLAN 4. Watching in Wireshark on VM1, I could see it sending out ARP requests for the IP of VM2 (a VM in vlan 4 on host 3) and I could see the ARP requests coming into the core switches (watching via a span port). However, watching Wireshark on several of the target VMs I never saw any of VM1's ARP requests at all. If I vMotioned VM1 to any other host, everything worked perfectly as expected.

So... weird. After much experimenting and head scratching, I tried removing the VM1's virtual NIC, and adding a new one (new MAC address, identical IP config). Immediately everything worked as expected.

Round 2:

Weeks later, VM1 got moved to host 4 and immediately lost access to all VMs on host 2 VLAN 4, but again, access to all other hosts and VLANs worked as expected, no other VMs seemed to experience the issue, and everything worked perfectly if I moved VM1 off host 4. Same behaviour with the ARP requests, I could see them leave VM1, see them cross the switch, but never see them in any target VM. I again deleted it's virtual NIC and re-added it, this time as the VMXNET 3 adapter type rather than E1000. Access to host 4 VLAN 4 started working again, as well as VLAN 4 on all other hosts.

Round 3:

Five minutes later, I get alerts for everything in VLAN 1. VM1 is not able to ping any VM on any host in VLAN 1, but can reach everything else (anything not virtualized) on VLAN 1. I can see the packets coming in the switches, up to the firewall, and back into the switches on the correct VLANs in Wireshark. However, watching in Wireshark on several of the VMs in VLAN 1, I don't ever see the echo request. But, physical servers and other devices in VLAN 1 are no issue.

What the hell??? I guest the fact that things change when I change the virtual NIC made me think it was some kind of weird layer 2 connectivity issue with ESXi, possibly related to the MLAG load balancing config somehow? I've gone over and triple-checked both ends of the MLAG up-links, and don't really see anything that looks unexpected. The ESXi end has one virtual switch, with 2 uplinks, the load balancing mode is "route based on IP hash". The switch side has the up-link added to a bond with the CLAG ID matching on both switches and the bond mode set to balance-xor, layer3+4 mode.

Any thoughts?



No comments:

Post a Comment