Monday, February 17, 2020

Inbound packet loss/OOO packets/dup acks to DC VM environment

Hello r/networking!

*Edit - We've checked all the switching MTUs and have no mismatches.*

We've been trying to track down an issue in our data center involving inbound TCP issues for about a week now with no clear answer on what could be the cause...and of course we're the guys defending the network. I know there's plenty of guys on here smarter than me so hopefully someone can help point our team in the correct direction or share some experience.

The infrastructure - It's your typical legacy design

ASR 1001x's at the edge doing default route to the ISP

Palo Alto VM300 - firewall

ASR 1002 - DCI Router to DR DC

Meraki MX 450's - Single arm VPN concentrator

Nexus 3548's - as service edge switches - VPC between them

Nexus 3132q's - as core switches - VPC between them

6300 Series FIs - Connects UCS environment to the network

Route's learned via OSPF - HSRP on the VLANs - SE and Cores connected @ L2 via port channels and then L3 links between the cores and se switches. Very simple and not much outside of that supporting the VM's in the DC.

Items of Interest -

  1. No networking config changes have been made within 30 days.
  2. BFD between the SE and Cores caused an outage about 3 months ago. "no bfd" on the interfaces connecting the SE and Cores brought the DC back up. Cisco's RCA was "don't use BFD internally"
  3. The firewall is virtual (Palo Alto)
  4. No traffic shaping in place anywhere inside the DC
  5. I tested a backup job to an appliance connected via the core switches and receive the expected speed
  6. A large majority of the VM's have been updated to 2019 recently - I've tried to disable ECN with no change in behavior.
  7. VM to VM traffic is very very quick
  8. Moving a file to another cluster on a different FI produces good results inside of the DC in question.
  9. File downloads from the internet on a test server works fine

The problem -

Recently one of our sys admins noticed a .iso file transfer from our office via the Meraki VPN moving very slowly inbound into the DC in question. I saw it transferring @ 355 KB/s so I began to check for congestion on our inet circuit. Looking at our netflow monitor it was only @ 13% utilization at that particular point in time so that prompted a deeper look into things. (1G net connection)

I had him test the same transfer into another DC which moved at the correct speed into that location. Now I'm really starting to dig into what else could be going on to cause congestion. I pull up solar winds and no high CPU, no high utilization interfaces on anything... very odd. We then think "Perhaps its the net circuit so lets move the file via the DCI!" The 1G DCI really isn't used for much during the day and the path it takes into the DC bypasses all the usual suspects - The firewall, the meraki, the internet.

The DCI traffic produced the same exact results, same speed, all while outbound traffic moved at expected speeds. Now I'm looking @ interfaces for errors, drops, anything to point out the culprit. No errors on the interfaces, no incriminating drops, nothing. We start doing packet captures on the server and client to find MASSIVE amounts of unseen segments, out of order packets, dup acks. Error correction is obviously causing the slow moving inbound traffic.

I had the idea this morning on the way into work of testing without the VM/Windows environment being involved. I would add some data to the file server at our office then start a backup job to an appliance connected to the core switches. This would traverse the same path as SMB but not hit the FI's nor end on a VM. This worked perfectly. Now I'm really at a loss - We've had TAC looking @ it but that may take weeks to resolve. We've had several different engineers take a look also but nothing outside of "we need to capture at every hop". We are in the process of this but its taking quite awhile due to one side being managed by the data center staff.

If anyone needs more specific data or screen caps let me know.

I'm really starting to believe its something in the FI or VM environment but as usual have to defend the network first.

Here's a screen cap of Wireshark from the receiving server -

https://i.imgur.com/LdBd2Gd.png

Suggestions, comments, ideas, theory - all welcome here!

TL:DR - Inbound SMB/FTP traffic to our VM environment produces dup acks, out of order packets, and re-transmits. Switches and circuits check out. Moving data to a backup appliance vs a VM works as designed.



No comments:

Post a Comment