Friday, April 19, 2019

I think we found a TCP-related problem on our network, but we don't know how to fix it.

So we've been investigating an application performance issue. When the application is used locally, it works 'fine.' When it is used from a WAN location, then the users experience 2+ full minute wait times.

We troubleshot and verified layers 1-3 pretty quickly. It did not appear to be any kind of routing problem, no interface errors or discards, no MTU mismatch, etc.

So we dove into packet captures, and found something of interest. The client (windows endpoint) sends a TCP initial window size of 8192. That seems... small. Both the client and the server have window scaling enabled, with a factor of 8, and the client completes the handshake by changing its window to 65536 (that is because of the scaling factor of 8 I'm assuming?) So the handshake looks like this:

CLIENT: TCP SYN - WIN 8192 - WS 256 (8) SERVER: TCP SYN+ACK - WIN 8192 - WS 256 (8) CLIENT: TCP ACK - WIN 65536 

So we took the numbers including bandwidth, latency, MTU, bytes that were transferred, and a window size of 65536 and plugged them into a TCP Throughput Calculator we found online. The calculator perfectly agreed with what the users were experiencing: about 2 minutes to complete the transfer!

Wow...

So I read up and it does seem that initial TCP receive window the client is sending of 8192 appears to be the issue. That value should be able to be 65535 and then scale up to a factor of 8 from there.

When we entered those numbers into the TCP Throughput Calculator, changing nothing else, just window size, the transfer time then changed to about 2 SECONDS.

I'd always read about BDP and windowing issues, but this was kind of a wake up call about how big a deal this can really be.

Interestingly enough we started looking around further and noticed pretty much all of our client endpoints, including my own, are sending this "8192" value in their initial SYN...

At this point we started getting excited, because how often do you find a breakthrough like this in packet captures.. something that may have been causing other complaints all along that no one noticed.

So we brought the findings to our Systems guys, and they seemed... unimpressed. They just pointed out that the application works fine locally, but is slow when used over the WAN, so that problem is the network.

So now we're kinda having to prove our case to them, but as I searched Google for Microsoft Windows networking stuff (painful) I'm having a really hard time figuring out exactly HOW to change this value.

A few very old posts talk about 2-3 different registry keys to change. But most of the more recent posts complained that changing the registry keys doesn't actually do anything. It seems that since Microsoft Vista and all subsequent versions, changing those values has no real effect, because it's all "automatic" now. As somewhat proof of that, my own registry key has a 65535 value, but my machine is indeed sending the 8192 window size (confirmed in wireshark.)

I did find some chatter about Windows "TCP Scaling heuristics" and how it can cause issues and should be disabled (it is indeed enabled on our endpoints.) There is some other chatter about Windows TCP "Auto-tuning level" which has different choices like "Normal," "restricted," "experimental," etc.

I don't know why, but Microsoft has seemingly dumbed this all down and basically has the stance of "TCP Window Scaling is something the Windows Operating System handles automatically now, it is not something the admin can adjust! Trust us, we know best. Signed, Microsoft."

Ugh. It's a bit frustrating to say the least. Anyone know the networking stack in Microsoft reasonably well?



No comments:

Post a Comment