Thursday, September 23, 2021

Weirdness - give me your theories on the cause

So, I had something interesting happen at work today. I've put in a successful workaround (mentioned below) but neither I, nor any of my coworkers have any idea what is going on.

A few things to get out of the way ahead of time here:

  1. There are no firewalls anywhere in this path
  2. There is no IP conflict - trust me, it's been investigated very thoroughly
  3. Everything that isn't an L3 point-to-point link is a /24 subnet
  4. There are some ACLs, but they were unbound from the SVIs for the sake of testing.

So here's the scenario:

I have two PCs on the same subnet, in the same VLAN, on the same L3 switch. They are IP address 10.50.206.30 and 10.50.206.31. They both need to reach a particular IP in the data center (10.50.0.51) - for the purposes of our situation, a ping can be considered a success.

The topology is this: L3 Edge switch (4510R+e), dual uplinked with one link to each of two campus core switches (N77-7710). The data center destination switches are a pair of C9300. Each of these two switches is uplinked to both campus core switches, and there is also a trunk between them. The VLAN of interest on these two switches contains the IP we're trying to reach, each switch has an SVI in that subnet, and HSRP between them is used for the gateway on that subnet.

OSPF is used for routing everything.

The problem itself is pretty simple - 10.50.206.30 can ping our destination of 10.50.0.51, while 10.50.206.31 can not ping 10.50.0.51.

Now, you're going to see some pretty stupid steps taken here, but it's because anything that made sense as a possible solution did nothing for us. While trying to determine what was going on here, the following steps were taken:

  1. Disabled one of the uplinks between the edge switch and campus core. Didn't help.
  2. Moved HSRP active to a specific C9300, downed both the uplinks from the second C9300 to the campus core switches. Didn't help.
  3. Brought those links back up because nobody wants to be running non-redundant.
  4. Did pings from the edge switch, sourced from the SVI for the 10.50.206.0/24 subnet against the C9300#1 SVI, C9300#2 SVI, HSRP address, and 10.50.0.51 - all succeed.
  5. Moved the 10.50.206.31 machine to another port (hell, why not?) and the pings continued to fail (note, 10.50.206.30 was in production, couldn't just swap IPs between them to see what happened).
  6. Disconnected the device at 10.50.206.31, connected a laptop to that port with an address of 10.50.206.31. Pings failed. (I was surprised, I was sure it was something wrong with the original machine)
  7. Changed the laptop to a new address at 10.50.206.33. Pings against 10.50.0.51 succeed.
  8. Disconnected laptop, changed the device that was originally 10.50.206.31 to 10.50.206.33. Pings succeed.

And after that...think really hard as to what could be going on here, because "bad IP address" seems like the only possibility. And that isn't actually a thing.

Anyone have any good theories on what we could be a possible cause here? If there's a good enough theory that won't impact production I'll go onto the campus, static myself a 10.50.206.31 and see if I can ping the 10.50.0.51.



No comments:

Post a Comment