I inherited this situation, I know it's wrong :)
Setup:
- 2x ToRs, both Catalyst WS-C3560G-48TS-S running IOS 12.2(35)SE5 fc1 and 12.2(55)SE9 fc1
- 4x ESXi 6.0 hosts, which split their NICs rougly 50/50 between ToRs
- All ESX hosts allocate minimum 2x1GbE ports to iSCSI
- 1x Nimble storage array, all NICs split 50/50 between ToRs
- All iSCSI ports are on private vlan and directly connected to ToR switchports. No iSCSI traffic traverses trunks (at least as far as I can tell)
Problems/Symptoms:
- Symptom reported: performance craters intermittently on certiain high-I/O VMs.
- Network-level issues observed: very high output drops (>600k total, >3k/hr) on iSCSI switchports. All other interface counters were zero or normal. These interfaces do not show high traffic/saturated in NMS.
I suspected microbursts per /u/VA_Network_Nerd's many excellent posts I found while searching. I also read about flow control, QoS, and a few other things.
What I did:
Forklift upgraded the whole stack to 10Gb+haha yeah right, this is K12 and e-rate isn't what it used to be- Manually balanced high-I/O VMs across ESXi hosts to spread storage traffic across as many NICs as possible
- Gave the two "main" ESX hosts two more NICs each in their iSCSI vSwitch (4x1GbE total for each host)
- Set DRS to manual so VMs don't end up lopsided again
- Enabled Rx flowcontrol on all iSCSI switchports and verified vmnics now show it enabled
Current situation:
After all the above, I cleared counters and show fewer output drops on iSCSI interfaces; either zero or < 100/hr avg. Still waiting on end-users for VM performance feedback.
Questions:
- I understand output drops usually result from congestion. Is there a way to ensure they definitely are not coming from other sources like hardware fault or misconfiguration?
- Is there a generally acceptable rate of output drops for iSCSI traffic?
- As I understand it, QoS only helps when multiple types of traffic traverse a link, and provides a way to decide which of that traffic gets dropped when congested. In my case, would QoS provide any benefit?
- Based on VMware's defaults, and Nimble's deployment considerations, I enabled Rx flow control on all iSCSI interfaces. Is this still advisable?
- Might tuning buffers help?
- Any other suggestions for tuning iSCSI here, other than adding bandwidth?
No comments:
Post a Comment