Thursday, January 17, 2019

Win16 LACP team to vPC on 9k dropping under SQL load

We have a few windows 2016 hosts (FX2s) clustered, runnning SQL 2016, and using Windows NIC Teaming to a pair of Nexus 9ks.

The severs are 16G fibre backends to the SAN on dedicated oob 9k fibre switches

When we run heavy SQL IO operations on the nodes, they are getting disconnects on the network IO according to the SQL logs, and this is causing cluster shuffle.

We tried with both CSV and non-CSV disks and experianced the same issue.


Windows Setup:

Team 10G NIC 1 and 10G NIC 2

LACP Timer: Fast (windows default)

Method: MAC ADDRESS

MTU: 9000

One virtual team per vLAN: 100 - LAN 200 - Private Heartbeat 210 - Private Heartbeat 220 - Private CSV Communication


Nexus vPC Setup example:

int po 531

desc SQL Node 1

switchport mode trunk

spanning-tree port type edge trunk

mtu 912

vpc 531

int e1/53/1

desc SQL Node 1 NIC 1

lacp rate fast

switchport mode trunk

spanning-tree port type edge trunk

mtu 9126

channel-group 531 mode active


This setup works well when testing as follows:

Allows full 20Gb/s throughput, both inbound and outbound.

Unplugging a link drops to 10Gb/s with no dropped traffic/pings in application.

Plugging a link back in to the port returns to 20Gb/s in/out within 10 to 15 seconds.


However when the SQL servers run reindexing jobs, which are a fairly heavy load, we are getting errors that the operation failed due to a TCP/IP error, the network name isn't reachable, and the nodes are moving to a new host.

We have also run this test after removing one nic on each system to eliminate the teaming and it still fails.

When we tried removing the vPC config and changing windows to use switch independant teaming and did not get this failure.

What coule be the cause of the issue we might be able to addrsss in the vPC setup?

We would prefer to use the vPC to have the full agregate bandwidth inbound and outbound if possible.

Thanks for any help on this!



No comments:

Post a Comment