Thursday, April 22, 2021

How I solved the weirdest speed bottleneck problem

I've been setting a few routers in a router-on-a-stick configuration and I had the weirdest speed problem crop up that technically never should have happened. It all started when I daisy-chained an upstream router to a downstream switch + router which used SFP+ ports and VLANs to make a theoretically perfect 1-gigabit connection. Then, I noticed a BIG problem. When connecting directly to the upstream router, I could get a full 940/940 speed that's limited only by the 1000base-t standard. However, when connecting to the downstream router connected via SFP+ to a 10 gigabit switch, I could only get 800/60 Mbps at the most. At first, I thought it was a simple CPU problem, but it couldn't have been because it was capable of routing the full speed of the SFP+ port. I thought it was a switch offloading problem. VLANs can sometimes be tricky, so I made sure the switch chip handled everything, but it was. Fancy QoS queues can take a toll on processors, so I disabled that as well. I still couldn't get above a 60-megabit upload speed even though it should be 10 times that so I knew it had to be a L1 problem. This gremlin persisted for days until it hit me. I realized that upstream, the router only had a plain old gigabit ethernet port. I was connecting to it via a 10Gbase-t SFP+ which kept auto-negotiating to the full 10G speed. What was happening is that the SFP+ kept sending 10G-encoded signals to a gigabit port that couldn't understand MOST of them, as obviously a few made it through and could be understood. However, the signals sent by that gigabit port were all somehow being understood by the SFP+. This shouldn't technically happen according to the wildly different standards for 1000base-t and 10Gbase-t but it did. To make a quick fix, I disabled auto-negotiation and set the SFP+ to a normal 1 gigabit speed because it was listed in the switch as working at that speed. This made the problem worse as now NO packets could be sent at all. My other SFP+ modules did support doing this, so maybe it was just a junk model that I got. I then tried swapping out for a plain-jane gigabit SFP and I suddenly got the full 940/940 speed downstream. What did I learn? Make sure your PHY rates make sense, and make sure your SFP+ twisted-pair modules support slower speeds.



No comments:

Post a Comment