Wednesday, April 18, 2018

Single source to single Citrix Netscaler - strange traffic issues

I've had such a strange/hair-pulling issue over the past few days and such a strange workaround too:

  • Multiple clients all connecting to a remote (over internet) Citrix Netscaler, hosted in an ISP datacenter. Working fine for years
  • Monday, two clients using the same ISP were getting constant disconnections. Packets showing increased duplicate ACKs are the only real difference
  • To replicate quickly, just have to connect and increase the bandwidth by running a YouTube video in the Citrix session
  • Fixed one client by switching the route to their second internet link. Same ISP, but some different hops. Strange, but then focused on the next client
  • The second client has an MPLS provided by the same ISP, with a single internet exit point
  • Confirmed:
    • Second client can connect to other external Citrix farms and run in them without issue (same ISP and different)
    • The second client's Citrix pool behind the first Netscaler can be connected to without issue from elsewhere and be run in without issue (same ISP and different)
    • The second client can connect to other services (e.g. RDP) hosted in the same place as the first Netscaler
    • The second client can't connect to any additional or NATed IPs applied to the first Netscaler, while other places can
    • The second client has issues on both ports it uses on the Netscaler (443 and an alternate for a different connection type)
    • Tested during low bandwidth times
    • ISP made no changes to any nodes in the path recently
    • We've made no changes to the destination
    • Client has made no onsite changes (MPLS is servicing more than one site as well)
  • Changes that didn't fix it:
    • Got the ISP involved and had them change the route from the MPLS to the first Netscaler
    • Restarted the MPLS internet exit device (PFSense that hasn't changed config in years) and updated its firmware
    • Restarted the router in front of the Netscaler
    • Restarted the Netscaler
    • Dropped the MSS on the Netscaler from 1460 to 1380
    • Removed all QoS from nodes between the test devices in the MPLS and the first Netscaler
    • Change the external IP of the source (static source NAT on the PFSense)
    • Cleared PFSense state tables
  • What finally ended up working around it:
    • I created a site-to-site IPSec VPN between the PFSense router, and the Cisco router in front of the Netscaler
    • The IPSec tunnel only allows traffic to the same original external Netscaler IP, not a different internal one
    • The tunnel is just doing its standard thing of wrapping packets and bypassing PAT on the PFSense side (Cisco side Netscaler already has a direct public IP so there was no NAT there to begin with)

Now this is super strange to me - the tunnel should be adding extra data/load to the packets and endpoints. If it was an intermediate ISP node issue I'd assume the packets would have to be smaller to fix it, not larger.

The ISP thoroughly assisted with the troubleshooting (really thankful to have good support here) and is confident it's not their issue. I can see it from their point of view and can't imagine what it could be that's causing such a specific source/destination issue like this.

Has anyone ever come across anything like this?



No comments:

Post a Comment