Thursday, January 17, 2019

Could I get some advice with this intermittent drop? I'm banging my head on my desk and need a fresh opinion please.

Here's the scenario:

I am periodically losing WAN connection to a publicly routed device at random times, and it's always down for about 20 minutes.

We have a FirstComm Juniper ISP gateway that plugs into a 3750G switch. Off that switch, we have several devices (all in the same public IP block and on the same VLAN):

  • Our main ASA firewall, behind this is our main LAN
  • A third party ASA that connects to a server
  • A Juniper Oracle VPN device, which people in the LAN connect to and go send work over to a third party site.

Facts:

  • I work in a different location and monitor these devices remotely.
  • I never lose connection to the main ASA in either direction.
  • Periodically, we lose WAN connections to both the Oracle device and 3rd party secondary ASA, but never at the same time, and always in blocks of about 20 minutes. Sometimes closer to 15, sometimes 25, but always in that window.
  • When I lose connection to these devices from the WAN, the connection from the main LAN ASA never drops, and they are pinging the same public IP as we are from the LAN. I have a constant ICMP test to these devices from both the WAN and main LAN, and it drops from the WAN, but never from the LAN.

I have wireshark capturing the interfaces of two ports: the ISP port to FirstComm, and the Oracle device. When it drops, I see the ICMP traffic entering the ISP port on the 3750G, and I see it leaving the Oracle port. I see the Oracle device reply, but the ISP port never sees the reply. The ICMP reply gets lost after it enters the 3750 from the Oracle device.

I have verified that the MAC address isn't changing when I lose connection, and I know that The ISP ARP isn't getting hijacked because I never lose connection to the main LAN. And to state it again, when the Oracle device becomes unreachable from the WAN, the LAN cant ping the same public IP and it never drops.

My first inclination was to replace the switch, which i did. It's the same model, but a completely different version of the firmware (went from IOS 15 to stable 12) and it made no difference. My second was that something was taking over the ARP, but the MAC addresses aren't changing in Wireshark, and some devices can always reach them anyway.

I dont think it's the ISP, as the traffic is coming in from their gateway, and the Oracle device is replying. I'm seeing anything that correlates to the drops, like increased CPU, in creased traffic, etc.

I know there is a logical explanation, but maybe I've been staring at this for too long to see it. :(



No comments:

Post a Comment