Monday, December 7, 2020

Intermittent but frequent TLS handshake errors, possibly due to packet loss. Help!

Since 11/17/2020 I have been experiencing intermittent, but frequent TLS handshake errors (either a timeout or a "Bad record MAC"). This problem is apparently only occurring between some AT&T broadband clients and websites hosted by HostGator.com. HostGator tech support asserts the cause is packet loss outside their data center, so not their problem. However, I'm having a difficult time isolating the source and/or cause and could really use some help before I get carried off in a straightjacket.

I have reduced the test case to using the "curl" command line utilitity to retrieve a plain text file from a website with a UTC timestamp as the query string: "curl -v https://HOST/test.txt?TIMESTAMP". I repeat the test 25 times, sleeping for 1s between attempts. This eliminates the web browser and/or caching as a cause.

I can replicate the problem from two different houses in adjacent cities with AT&T broadband (one U-verse, the other with Sonic.net which uses the AT&T infrastructure), each with a different model of Arris router, from both Linux (Fedora & Raspberry Pi--one house only) and Windows 10 to two different HostGator servers (one a Virtual Private Server, the other a shared hosting package with different IP addresses). I am seeing the problem with other HostGator servers, but focused on two for the attempted diagnosis of this problem. The problem does *not* occur with a website hosted at GoDaddy from the same client systems or between the two AT&T client systems or between the two HostGator webservers. Nor does it occur to the same websites from a Raspberry Pi via unWired Broadband or Frontier Communications DSL from a location near Fresno, CA.

The problem is symmetric in that I can run the same curl test on one of the HostGator servers back to both AT&T client systems running Apache and get TLS handshake errors. I have no trouble with the popular websites like Google or YouTube. I'm now seeing occasional timeouts with regular FTP trying to perform mput/mget. Transferring files individually seems to work reliably after I get connected, but the transfer of a set of files can hang up. The Fedora Linux and the Windows 10 curl clients are negotiating TLSv1.3 while the Raspberry Pi negotiates TLSv1.2.

What's common about the failing scenarios is that they traverse three networks: AS7018 (AT&T), AS2914 (NTT) and AS46606 (UnifiedLayer, which is a subsidiary of Endurance International which in turn also owns HostGator.com). However, the path from the clients to the websites and the reverse path are usually different with only the endpoints being shared. I can't even guarantee that the path I am recording using the mtr utility is exactly the same path the curl requests are traversing.

mtr often reports packet loss at a one node on the NTT network (sometimes at two adjacent hops in that network), so I opened a ticket with their NOC and their response was that their network was healthy and they are not seeing any packet loss either from their node on to the website or back from their node to the AT&T client. NTT also says packet loss may be due to low priority for ICMP packets. On the way from the client to the HostGator servers I can see packet loss at both NTT & UnifiedLayer nodes. On the way back from the HostGator servers to the AT&T clients mtr shows packet loss in the NTT and AT&T networks. So packet loss does appear on the NTT network in both directions.

I recorded the network traffic with Wireshark for one of my test runs from the client end and found that the successful sequence between client and server is:

=> SYN
<= SYN, ACK
=> Client Hello
<= ACK
<= Server Hello, Change Cipher Spec, Encrypted Extensions
=> ACK
<= ACK
=> ACK
<= ACK
=> ACK
<= PSH, ACK
=> ACK
<= Certificate, Certificate Verify, Finished
=> GET /test.txt?Mon_07_Dec_2020_...

When it fails with a Bad Record MAC error, the sequence is a bit different:

=> SYN
<= SYN, ACK
=> ACK
=> Client Hello
<= ACK
<= Server Hello, Change Cipher Spec, Encrypted Handshake Message
=> ACK
=> Alert (Level: Fatal, Description: Bad Record MAC)

And when it fails with a timeout:

=> SYN
<= SYN, ACK
=> Client Hello
<= ACK
<= Continuation Data (TCP previous segment not captured)
=> Dup ACK
<= TCP Retransmisison
=> ACK
=> FIN, ACK
<= FIN, ACk
=> RST

Specifically, the "Server Hello" message description changes from "Server Hello, Change Cipher Spec, Encrypted Extensions" to "Server Hello, Change Cipher Spec, Encrypted Handshake Message" with the Bad Record MAC error. So it looks like the client is receiving something different from the server. Unfortunately I can't capture packets from the other direction (although I suppose I could capture them on the client end, but wouldn't see what gets sent but lost). When the timeout occurs, again a different packet (Continuation Data) is received from the server.

Something changed a few weeks ago, apparently outside my environment, given that I've been running the same clients against the same websites for some years with no evidence of this problem. I'm not seeing any of the SSL libraries on the Linux systems having been touched close to 11/17/2020 and it also fails with the Windows 10 client. I'm also not sure how much trust to put in the packet loss data from mtr.

This problem is making it hard to get work done on these websites because I have to keep reloading web pages when they get stuck while establishing a secure connection. I need to find a solution.

Where should I look for help next? What have I overlooked? It doesn't seem like I have enough concrete evidence of a specific packet loss failure given packet loss can be found at different nodes on different paths to/from the website. Perhaps packet loss isn't the direct cause of the TLS handshake errors? Help!



No comments:

Post a Comment