Friday, March 1, 2019

Strange loopback reachability issue

I am running into a very strange problem that I've been pulling my hair out for the past few hours over. I operate an ISP running the following architecture. Every link in this diagram is using IS-IS for loopback reachability, then running LDP on top of that for MPLS label distribution.

We distribute only transit (link) subnets and loopbacks into the IGP.

We are having a strange issue where a specific loopback address (10.30.1.74) is having reachability issues from one of our core routers (Core-02). Core-01 can ping 10.30.1.74 just fine, however, Core-02 cannot reach it.

To try to figure out what is going on, I have done the following:

Traceroute from Core-01 to 10.30.1.74:

root@Core-01> traceroute 10.30.1.74

traceroute to 10.30.1.74 (10.30.1.74), 30 hops max, 52 byte packets

1 172.16.15.2 (172.16.15.2) 15.830 ms 21.660 ms 21.762 ms

2 172.16.15.14 (172.16.15.14) 14.914 ms 21.869 ms 27.982 ms

3 172.16.20.177 (172.16.20.177) 19.808 ms 21.863 ms 22.025 ms

4 172.16.22.220 (172.16.22.220) 0.943 ms 0.857 ms 0.814 ms

5 10.30.1.74 (10.30.1.74) 1.059 ms 0.858 ms 0.803 ms

Perfect, this is working fine.

Now let's try that from Core-02:

root@Core-02> traceroute 10.30.1.74

traceroute to 10.30.1.74 (10.30.1.74), 30 hops max, 52 byte packets

1 172.16.15.6 (172.16.15.6) 8.094 ms 21.316 ms 22.125 ms

2 172.16.15.22 (172.16.15.22) 40.049 ms 32.727 ms 35.296 ms

3 172.16.20.177 (172.16.20.177) 40.850 ms 42.099 ms 32.925 ms

4 172.16.22.220 (172.16.22.220) 18.017 ms 21.872 ms 21.895 ms

5 * * *

6 * * *

Ok, that's not good, it seems to be "getting stuck" between PE-02 and PE-03.

So, as a sanity check, let's traceroute from PE-03 to Core-02:

root@PE-03> traceroute 10.10.0.21

traceroute to 10.10.0.21 (10.10.0.21), 30 hops max, 40 byte packets

1 172.16.32.17 (172.16.32.17) 1.051 ms 1.111 ms 0.817 ms

2 172.16.22.217 (172.16.22.217) 15.297 ms 18.195 ms 21.619 ms

3 172.16.20.180 (172.16.20.180) 16.360 ms 18.245 ms 21.903 ms

4 172.16.15.21 (172.16.15.21) 18.458 ms 20.179 ms 19.407 ms

5 10.10.0.21 (10.10.0.21) 1.007 ms 1.078 ms 0.906 ms

Weird, that seems to work fine.

Now it gets even weirder. Let's change the loopback address of PE-03 from 10.30.1.74 to 10.30.1.80.

Once I do this, no issues with reachability between any routers. Both Core-01 and Core-02 can reach 10.30.1.80 without issue.

10.30.1.74 is not used anywhere else on my network. If I take PE-03 offline, 10.30.1.74 does not appear in the IS-IS database or LDP database whatsoever, so this is not an issue caused by duplicate routes.

Any troubleshooting ideas on what I should try next? Sure, I can just throw away 10.30.1.74 and never use it again, but I really would like to know what's going on here, it could be a symptom of a larger issue.

Also, please let me know if you would like me to post any additional command outputs from the routers!



No comments:

Post a Comment