We have a few windows 2016 hosts (FX2s) clustered, runnning SQL 2016, and using Windows NIC Teaming to a pair of Nexus 9ks.
The severs are 16G fibre backends to the SAN on dedicated oob 9k fibre switches
When we run heavy SQL IO operations on the nodes, they are getting disconnects on the network IO according to the SQL logs, and this is causing cluster shuffle.
We tried with both CSV and non-CSV disks and experianced the same issue.
Windows Setup:
Team 10G NIC 1 and 10G NIC 2
LACP Timer: Fast (windows default)
Method: MAC ADDRESS
MTU: 9000
One virtual team per vLAN: 100 - LAN 200 - Private Heartbeat 210 - Private Heartbeat 220 - Private CSV Communication
Nexus vPC Setup example:
int po 531
desc SQL Node 1
switchport mode trunk
spanning-tree port type edge trunk
mtu 912
vpc 531
int e1/53/1
desc SQL Node 1 NIC 1
lacp rate fast
switchport mode trunk
spanning-tree port type edge trunk
mtu 9126
channel-group 531 mode active
This setup works well when testing as follows:
Allows full 20Gb/s throughput, both inbound and outbound.
Unplugging a link drops to 10Gb/s with no dropped traffic/pings in application.
Plugging a link back in to the port returns to 20Gb/s in/out within 10 to 15 seconds.
However when the SQL servers run reindexing jobs, which are a fairly heavy load, we are getting errors that the operation failed due to a TCP/IP error, the network name isn't reachable, and the nodes are moving to a new host.
We have also run this test after removing one nic on each system to eliminate the teaming and it still fails.
When we tried removing the vPC config and changing windows to use switch independant teaming and did not get this failure.
What coule be the cause of the issue we might be able to addrsss in the vPC setup?
We would prefer to use the vPC to have the full agregate bandwidth inbound and outbound if possible.
Thanks for any help on this!
No comments:
Post a Comment