Friday, September 3, 2021

TCP Retransmits and wierd ACKing bottlenecking w/o packetloss

Hi!

Relevant image from wireshark capture at client: https://zerobin.no/?659ba3fb227ee99d#GHWgarZnReicdZWGb75R9CumYD5GbtQAbv2mog1wChzn
(3 segments recieved at the same time, 1st segment re-transmitted 0.02s later and just after the client ACKs the three first segments)

We're struggling with a machine "here and there" in our ~1000 machine network where suddenly connections to servers are dropped from ~850Mbps down til 2.5Mbps. This happens -within session-, it can be SQL-requiring application, SQL-performancetesting, SMB and iPerf - anything, really.

If we have to computers simultaniously transferring data from a server, both located at the same place in the network, one can struggle, and one can be fine. The next day it's opposit. This happens at any of our ~100 directly fiber-connected sites towards our DC. The DC has 4 ESX hosts, and different switches, none which seems to have any problem and the issue can arise on whichever server. I'm also sure we've managed to get for instance 2.5Mbps on the iperf while at the same time 850Mbps in SQL performance tester - same client<->server, at the same time!

We seem to have drilled it down to the above linked image. Everything works well, until suddenly TCP ACK's from the client is delayed by 20ms as opposed to the normal ~0.1ms (as seen on client capture), at which time the server has already started re-sending segments (see TCP Duplicate-package). When this first starts happening, it happens a lot that day for that client, but may be fine again the next day, while another machine gets the problem.

The 10.82.66.16 is the client in this case, and 10.82.24.115 is the server. A full capture of the stream as seen by the client can be downloaded here: https://dropmefiles.com/QJ1ZA (never used that service before, but seems legit). Stream from FW and server looks the same, but I no longer have the files :|

We don't expirence any other problems really, we have low jitter and practically no packet loss with pingflood/UDP-iperf. We did try to set the TcpAckFrequency to 1 which temporarily did actually for some reason help, although we also see the problem with UDP. It works when the client is on WiFi, APs connected to the same switches. There's no dropped packets on switches, firewall or router.

We've tried not offloading the sessions in the firewall as well, but it really doesn't seem to make any difference, and the captures done at the server, FW and client simultanously are quite identical. On all three, we see the problem arise when the client waits those magic 0.02s before ACKing and the server starts retransmitting frames.

Hopefully someone can help, this is a true headache...



No comments:

Post a Comment