This might be a bit long. I'm mainly posting it in case some other poor soul runs across a similar situation, and because this is the first problem in my networking career that I almost had to walk away from.
I'm a senior somethingorother at an enterprise that's fairly distributed, and highly reliant on "the cloud". Due to our size there can often be 10-15 hops before we even reach the internet from our enterprise, which is an important detail that plays in later. I've seen a lot of problems in my time but this specific one had me and quite a few others confused for a while, especially due to a lot of contradictory evidence.
Users start reporting connectivity issues with a $cloud_vendor API from one of our datacenters, DC in North America, cloud API being in Europe. Previously this was working fine, and they indicate no changes were made on their side. Just this one region wasn't working, we hit the same API service from the same cloud provider in both north american and APJ regions with no issues from the same machines.
A packet capture from the client shows that there is likely packet loss here, with TCP retransmissions occurring until the death of the TCP flow via a FIN from the cloud provider's API server. Right away we saw that the TCP three way handshake worked fine, but our TLS client hello was seemingly never making it to the far end, the client was retransmitting it until the FIN came from the cloud provider because of them never receiving any data and just seeing it as an idle connection. It turns out this was possible to replicate 100% of the time from any CentOS 6/7/8, Ubuntu 18/20, Debian, or FreeBSD server in the datacenter, VM or bare metal.
Begin troubleshooting, and we do a packet capture at the datacenter edge where we connect into the wider enterprise network. This packet capture shows the lost TLS client hello leaving our network, with the packet being well-formed and not borked in transit. We engage the enterprise network admins, and in turn they do a packet capture at the enterprise edge on the cross-connect to the transit provider that was handling this flow; TLS client hello is also seen leaving here. At this point, we see the packet leaving the edge of our network and believe that we have exonerated our network and needed help from the cloud provider to determine if they were receiving the packet.
While waiting for the cloud provider to get in gear (And stop blaming our datacenter network, enterprise network, firewalls, "outdated" Linux kernels (that are as up to date at the still supported distro ships...), "outdated" TLS libraries, "outdated" curl, missing root CA certificates on our client, our TCP/IP stacks all being configured wrong, the color of paint in the datacenter not being to their liking) we did a bunch more troubleshooting. The default TTL on these Linux distros is 64, and we don't mess with that since we have NEVER had an issue with it. One of the first things we did was a traceroute, and we consistently saw the destination IP at hop 48, which we felt was far enough away from 64 for comfort. Additionally, a packet capture at our enterprise edge, datacenter edge, and host shows we are not getting an ICMP TTL exceeded back. So we moved on.
MTU/MSS was our next thought, the TLS client hello was only 300~ bytes but it needed to be ruled out so we pulled at this thread for a while as well, but it went nowhere since we were quickly able to rule this out based on some testing and playing with MTU + MSS clamping. For what its worth, most engineers I talked to about this problem quickly thought of MTU or MSS being the issue, so this wasn't time wasted by any means.
We determine that MacOS and Windows work fine from the same datacenter VLANs as the broken Linux clients, which confuses us a bit more and we start to rabbit hole on the fact its the TLS client hello getting lost and start to consider weird possibilities like the ciphersuites, extensions, or something else in the packet is tripping up a middle box somehow since of course the packets look very different from each OS. Honestly we did so much more troubleshooting in here like turning off TCP sequence randomization on our firewalls, bypassing TCP state checks on our firewalls, fast pathing traffic through any middle boxes, deploying machines right from vendor ISOs, etc. Nothing worked.
Since the API server is cloud service provider managed its not like we can get a packet capture on our own, so we were stuck here since the cloud provider kept telling us getting a packet capture wasn't possible. We argued that talking to all three transit providers between us in North America, and the cloud provider in Europe with like 30~ hops doing packet captures to determine where the packet was being lost would be insane. Again, we knew the lost TLS client hello was leaving our network, but we could not know if it was making it to the cloud provider and this seemed the best thing to check first. Their network engineers did not agree and fell back to the good old "Well no one else is having problems, and we're big, so clearly your network is broken"... Even though we were the only ones to ever provide packet captures.
Around this time we figure out that $cloud_provider's own Linux distro which is based on RHEL works fine from our datacenter, in the same VLANs. We ask the cloud provider what they have customized in this distro, and start doing our own A/B comparisons for proc tuneables related to TCP/IP. Turns out they touched a lot, and this was going to take time.
We setup our own instances running an httpd in the same region from the same cloud provider, and could not replicate the problem with clients from the same datacenter VLANs. It was only to the cloud provider's API.
I walk away from this having been working on it from weeks and decide I need to take a fresh approach. We knew that HTTPS wasn't working (using multiple clients like curl, openssl s_client, etc) due to the TLS client hello getting lost, but what about just telnetting to the API and sending data, would that data get ACKd at least even if that application/httpd didn't understand it? This turned out to be key. Even the tiny telnet packets with junk data weren't getting ACKd, and we saw the same retransmissions until the death of the flow via FIN from the cloud provider. In fact, NO data packets from this made it to the far end, the very first data packet never gets ACKd. At this point this eliminated a whole whack of possibilities, and I knew it was time to focus on the lower layers. I went to the working $cloud_provider distro VM and checked the default TTL, 255. Set the default TTL on our other Linux VMs to 255, and of course things start working.
There was a lot of conflicting data here between the traceroute showing this being 48 hops away, the TCP handshake working, the data packets not, the FIN,ACK packets from our side working to acknowledge the teardown of the flow, mixed in with a bunch of other things. As best we can tell (because $cloud_provider won't tell us any of the secret sauce) the cloud provider offloads some of the mundane TCP stuff, but the data packets to this service go further to some backend, either over load balancers or some ECMP/other load balancing setup that decrement the TTL but DO NOT originate an ICMP TTL exceeded message helpfully. We brought this up to them and it was more or less shrugged off, had we gotten this ICMP message we wouldn't have wasted to much time on this. We also indicated that this totally could be hit by any customer sufficient hops away, and their claim of "its just you" was not very convincing. I really doubt any transit providers are filtering these ICMP TTL exceeded messages, so its pretty likely the cloud provider isn't originating it because we know for sure it never even hits our enterprise edge.
A lot of lessons learned here, and I probably even missed some of the more obscure things we tried while trying to debug this.
Anyway, hope this helps someone or least was an interesting read. This was genuinely the first problem where I was starting to doubt my sanity.
No comments:
Post a Comment