Wednesday, December 6, 2017

Weird issues with DF.

Hi all - quick question. To preface, I know the hardware is older and not great. It was purchased before I came to this company and I can't do much about it right now. We have plans to do better, but I have to work with what I've got for now.

We have two locations connected via a dedicated two strand dark fiber. On each end of this is a Cisco 6500 with a 4 port 10G blade and the circuit in question connects to ZR Xenpaks. Maybe 3 or 4 times a year, this circuit will randomly die around 3-5am in the morning. It's always during that window (which may or may not be related) and wasn't happening that often.

The last week, it has happened 3 or 4 times, and it really messes things up. Avoiding the details of why it's a problem, I'm trying to figure out what we can do to continue troubleshooting. The logs from the 6500s show different things. On the main side (the building I work in), you can see iBGP flap briefly and that's it. On the remote end (datacenter), you can see the link status go down and back up. This is usually in milliseconds, and sometimes happens 2 or 3 times back to back before it stops. You can then see iBGP reconverge.

So far, I've replaced an excessive length SMF patch on the remote end with a 1m patch. It was tightly coiled and zip tied and that's obviously not good for a fiber patch. This didn't change much as it was around 4 days before the problem happened again, but we're starting with the easiest things to swap that take the least amount of time.

Tonight, I'm headed up to replace the Xenpak module with a spare and move it to another open port on that blade (just to rule that out). The light levels I'm seeing are within tolerance, but vary from side to side. Noise floor is -24 on these modules and -7 is considered peak/high. The local end where I am usually shows 1.0 Tx power and -15 Rx power, whereas the datacenter (remote) end shows 2.0 Tx power and -18 Rx power. These tolerances aren't optimal, but within spec. Many folks say this range shows there may be a fault somewhere in the run, but there obviously isn't a lot I can do about that.

We're adding a second link and when we do, we're going to have the carrier for the DF I'm talking about do an OTDR on this run, but we can't do that quite yet.

Has anyone seen any issues like this before? One side obviously shows more issues than the other in terms of local logs, so it definitely seems like it could be a hardware issue, but I only have a couple more items I can replace before I'm at the end of my rope.

Thanks for any insight!



No comments:

Post a Comment