Monday, June 7, 2021

Network slowness Microbursts

Hello all!

2 months ago we did a refresh at one of our remote site. We upgraded to 2X C9500 as a Core layer and 2 stacks of C9200L 48 ports. The first stack (6X C9200L) has 2X 10Gbps uplinks in LACP and the second stack (4X C9200L) has 2X 1Gbps uplinks in LACP to the Core. Our Core is connected to our SD-WAN firewalls. Pretty straight forward, no QoS, basic L3 inter-VLANs routing on our Core with a default route to the firewalls.

Since the upgrade users have been complaining about the network being slow. They try to get some reports in our DC and they take forever to load. I posted here 3 weeks ago and someone told me to run some captures and look for TCP errors, MTU, etc.

Lucky enough when I called the user this morning to setup the on-going capture on his PC, I saw the problem right in front of my eyes and was able to capture it. In the capture, I see a lot of TCP retransmissions. MSS is negotiated properly (1392) according to the path MTU. It goes as follow :

Client does an HTTP GET request with TCP segment length of 2246. The server responds with an ACK of 1393. The Client sends 6X TCP Retransmissions with Seq of 1393 and NextSeq of 2247. 854 Bytes in flight taking roughly 12 seconds...

It look to me that the ACK packets from the server do not make it to the client. Client does not get the ACK for the extra 854 bytes. I read that TCP Retransmissions were caused by network congestion or packet lost.

I decided to check the statistics of the C9200L port where the client is connected and I saw a lot of Output drops. TCP I/O graph in WireShark also shows almost 900 000 bits/1ms during that period.

Am I right to think that this slowness is likely caused by Microburst traffic ? They used to have 2X 1Gbps uplinks before and didn't have any issue. Could the user port be oversubscribed by the 10Gbps links and dropping the ACK packets because the buffer is full? SNMP doesn't show any signs of congestion but I guess the polling interval is too big.

What are my options ?

- Drop down the 2X 10Gbps to 2X 1Gbps?

- QoS? (Not very familiar with QoS)

- I saw on Cisco white papers that I could try the qos queue-softmax-multiplier command

If I drop to 2X 1Gbps, won't we be experiencing the same issue in the opposite side ? Right now there are only 15-20 users but after Covid there will be around 500.

Thank you for all your help everyone ! I've been learning a lot from this community since I've started in IT last year. Very grateful !!



No comments:

Post a Comment