Friday, April 19, 2019

Fail Back of redundant Host = 10-20 second outage on Nexus 9K environment. Pointers for my network team?

Working on setting up a new HCI environment and decided to test failover scenarios before going into prod.

Each node has 4 x 10Gb NICs.
2 are for the HyperConverged Storage.
2 are for VMware Mgmt and all the VM traffic and such.
These are all plugged into a pair of Nexus 9K switches.

During the Failover (we disconnected the LC fiber on 1 Port from each vSwitch), we get maybe 1 ping drop, pretty much as to be expected as the MAC Cache is dumped when the link goes down.

However when we plugged the NICs back in, we saw anywhere from 5 seconds (best) to 15+ seconds (the HCI lost its mind) of downtime.

I suspect that it MIGHT be related to MAC Cache (do they still call it a CAM Table?) on the switches, but we're really not sure.

My Ask, as we have network guys both in house here and additional contracted help, is what kind of things should be we watching/monitoring on the switches when we repeat the process again, so our network guys can see if there is any clue to why the fail back is taking so long.

PS: VMware 6.5 U2, Standard vSwitch (no vDS), Default Teaming mode of "Route based on Source Port" currently in use, and Notify Switches and Fail Back at the vSwitch Teaming level are both enabled (my understanding is Notify Switches = GARP to expedite the upstream MAC Cache updates)

Thanks



No comments:

Post a Comment