Friday, January 5, 2018

Questions about tuning IOS for iSCSI

I inherited this situation, I know it's wrong :)

Setup:

  • 2x ToRs, both Catalyst WS-C3560G-48TS-S running IOS 12.2(35)SE5 fc1 and 12.2(55)SE9 fc1
  • 4x ESXi 6.0 hosts, which split their NICs rougly 50/50 between ToRs
  • All ESX hosts allocate minimum 2x1GbE ports to iSCSI
  • 1x Nimble storage array, all NICs split 50/50 between ToRs
  • All iSCSI ports are on private vlan and directly connected to ToR switchports. No iSCSI traffic traverses trunks (at least as far as I can tell)

Problems/Symptoms:

  • Symptom reported: performance craters intermittently on certiain high-I/O VMs.
  • Network-level issues observed: very high output drops (>600k total, >3k/hr) on iSCSI switchports. All other interface counters were zero or normal. These interfaces do not show high traffic/saturated in NMS.

I suspected microbursts per /u/VA_Network_Nerd's many excellent posts I found while searching. I also read about flow control, QoS, and a few other things.

What I did:

  1. Forklift upgraded the whole stack to 10Gb+ haha yeah right, this is K12 and e-rate isn't what it used to be
  2. Manually balanced high-I/O VMs across ESXi hosts to spread storage traffic across as many NICs as possible
  3. Gave the two "main" ESX hosts two more NICs each in their iSCSI vSwitch (4x1GbE total for each host)
  4. Set DRS to manual so VMs don't end up lopsided again
  5. Enabled Rx flowcontrol on all iSCSI switchports and verified vmnics now show it enabled

Current situation:
After all the above, I cleared counters and show fewer output drops on iSCSI interfaces; either zero or < 100/hr avg. Still waiting on end-users for VM performance feedback.

Questions:

  1. I understand output drops usually result from congestion. Is there a way to ensure they definitely are not coming from other sources like hardware fault or misconfiguration?
  2. Is there a generally acceptable rate of output drops for iSCSI traffic?
  3. As I understand it, QoS only helps when multiple types of traffic traverse a link, and provides a way to decide which of that traffic gets dropped when congested. In my case, would QoS provide any benefit?
  4. Based on VMware's defaults, and Nimble's deployment considerations, I enabled Rx flow control on all iSCSI interfaces. Is this still advisable?
  5. Might tuning buffers help?
  6. Any other suggestions for tuning iSCSI here, other than adding bandwidth?


No comments:

Post a Comment