Thursday, May 21, 2020

How would you troubleshoot these resets/bandwidth issue?

Hey folks,

Looking for a few perspectives to make sure I don't have confirmation bias here. TLDR: Server team has issues replicating SAN across the WAN and gets resets when they attempt to use too much bandwidth (that they have configured within constraints of what physically is available for BW). My question to the community, has anyone had issues with Nexus flowcontrol or brocade flow-control causing resets on traffic?? Otherwise, what would you do to isolate the issue?

More context:

Topology generally looks like this:

SITE 1 SAN - SW - SW - FW - SW - Long Haul RTR - Long Haul RTR - SW - FW - SW - SITE 2 SAN

I've looked at interface speeds, and most are 1gb+ (10,40 etc), however the WAN connection at Site 2's long haul router is a 200m provisioned circuit, which is the lowest available bandwidth on the whole path. Obviously there's more traffic than just this SAN replication happening, but it's a lot of data to try and put into a logical way and reason on the issue.

Generally speaking, configs of sw/FW/rtr's are relatively simple... VLANs, security zones (srx's), BGP (long haul)... no QOS/COS/etc, no shaping/prioritizing.

The tricky part is that no interface is ever at 100% when they receive these resets. Specifically the "bottleneck 200mb" averages between 40-80% utilization... but they still receive resets when they dial the replication throughput up too high. I did notice (newish network to me) that TCP flow control is enabled on all of the switches, which makes me wonder if between that and TCP's native windowing if there will always be a buffer and I'll never see bandwidth hit the 100% but resets will occur due to those protocols trying to control utilization (or would I see a rollercoaster of peaks/resets?).

Thanks for any thoughts.



No comments:

Post a Comment