Monday, December 7, 2020

Issue with TCP traffic over MPLS VPN

Our client has a main site with a half dozen remote sites that route all of their traffic through it. About six months ago we replaced our L2 Metro Ethernet with L3 MPLS VPN (both from CenturyLink). All of the gear remained the same. The only major change is that we setup BGP to handle routing between our sites.

Since that change, we've been dealing with issues died to Active Directory from systems at remote sites (domain controllers are located at the main site). Certain apps that depend on AD will either lock up or timeout. Eventually I was able to trace the lockups to LDAP traffic over TCP port 389. Whenever a significant amount of data needs to transfer (in particular, something called a Schema Cache update, which is ~3.5MB), the data won't fully transfer and the TCP connection hangs until an application-level timeout was reached.

I was able to further isolate this issue to not just LDAP traffic, but any traffic over TCP port 389. Testing with HTTP or iperf over TCP 389 produces the same results. I even spun up a fresh server, got BGP running, and plugged it directly into the CenturyLink router (an Adtran 5660) at our main site. iperf tests directly to that device reproduce the issue.

Just recently I found that the failure is correlated with bandwidth. Our remote sites each have 50Mb back to the main site. But as the bandwidth of the TCP port 389 connections approaches 40Mb, the connection is more likely to deteriorate in less than 1s. However, as you slow the traffic down, the connection will maintain longer and longer. For example, at 38Mbit the connection moves traffic for ~30s. At 30Mbit, ~180s, at 25Mbit ~600s, etc. The connection still locks up at these times, it just maintains longer. This has at least given us a workaround where I've throttled TCP 389 traffic down to 20Mbit on our main site router.

Other important pieces of information. 1) The issue does not occur on traffic directly between the remote sites, only to the main site. 2) If we move traffic off the MPLS circuit over to a backup circuit (VPN over 4G LTE), the issue goes away. 3) Traffic on every other TCP port I've tested is unaffected (80, 388, 390, 636, 443, 8080, etc, etc).

I feel like I've done all of my due diligence and have enough to put this on CenturyLink. I've had a ticket going with them for about a week now, worked with three different techs, and they're all telling me they can't see any issues. One of them "made a correction to the WAN interface shaper" and rebooted their router, to no avail. Is there anything else I should be looking for on our end or does this seem like a provider issue at this point? Appreciate any direction or advice.



No comments:

Post a Comment