Thursday, February 28, 2019

Weird traffic cutouts after connection added to a site

Hey guys. Got a confusing mystery here and just wanted to bounce it off you experienced people for some ideas or insight.

The Layout

So there's a small school district that consists of a few schools. The schools connect back to one "main" school via a gigabit ELAN provided by an ISP. The main school has a L3 switch which handles all the routing. For Internet, a connection goes through the ELAN to a county office. It's not ideal, but things work fine. Or at least they did...

One day a charter school is added to the mix. They work out an arrangement to connect into the district network via Ubiquity wireless units and piggyback back to the county for Internet access. They get their own VLAN from their site all the way back to county where they get Internet access as well as VPN access to their other charter schools.

Here's a bad diagram.

Ever since that happened, some subtle but nonetheless weird things occur on the district network. In particular, monitoring software that periodically pings district equipment will alert periodically over switches, cameras, and other monitored networked equipment as being offline. This seems to only happen during peak working hours and when the networks are being used. It does not happen on weekends or weeks that schools are not in session. The charter school network's equipment doesn't appear to be affected at all.

One day during a storm knocked out the charter wireless for a couple days. The issue went away. When that link was repaired, the issue came back.

Summary of Troubleshooting

  1. The monitoring server is at School1 which is where the routing switch is. I packet traced the ELAN-facing port here and I packet traced the ELAN-facing port at School3. I could see the pings get sent from School1. I could see the pings at School3. I then see the ping replies come back through School3. But then I don't see the replies come back to School1.

  2. Networking equipment within School1 doesn't exhibit this behavior (pings here never have to traverse the ELAN).

  3. This behavior tends to clear up within 10 minutes, but then a different switch(es) or camera(s) at a different school(s) will exhibit the same behavior.

  4. Manually pinging a "down" switch during one of these episodes is met with no reply. Telnet, etc., doesn't work.

  5. Surprisingly haven't heard complaints about phone calls or web videos mysteriously stopping?

  6. I thought maybe STP might be in conflict somewhere so I filtered bpdus on links between the different networks (district/charter, district/county). Didn't change anything.

  7. I also monitored the port utilization on the ELAN port at School1 and School3 thinking maybe they were congested but School3 is rarely above 100Mb/s and School1 is rarely above 300Mb/s.

Any ideas or explanation that could be causing this weird behavior? Part of me wants to just say it's the ISP's (ELAN) fault because I can see the ping replies disappearing in it. But I really don't get why it only happens when the charter connection is up?



No comments:

Post a Comment