Wednesday, November 24, 2021

ASA inside interface stopped working, now DR site is down

Hello - I'm currently troubleshooting an ongoing outage, on about 4 hours of sleep, so I apologize for the brevity and/or ranting (and several things I'm leaving out - I can provide some config snippets, but need to know what I'm grabbing as my options are limited at the moment).
To make matters worse, our reseller (and myself) didn't realize that our SmartNet contracts are all sorts of jumbled up, so I can't get Cisco help until that's resolved, and my rep is on vacation all week. So I'm hoping someone here has some ideas of what to check, otherwise, I'm a bit up a creek atm.

  • I have two ASAs, both 5508 running single context, active/standby failover on 9.15(1)1.
  • Yesterday, while connected to the site via IPSec site-to-site VPN, I lost connection to some servers I had been working on (nothing network related). I checked my logs and confirmed with others that have access to the site that I was the only one connected to it at the time.
  • We utilize 3 interfaces (technically a 4th for the failover connection) - outside, inside (10.91.x.x sub), and development (10.31.x.x subnet). These subnets are segregated from one another due to contractual reasons.
  • Inside is blocking nearly all(? from what I can tell) traffic, even traffic from 10.91.x.x to 10.91.x.x IPs. EXCEPT on one server, I was able to ping its DNS server consistently(???). Couldn't ping the gateway IP (I do have ICMP allowed, and 10.31.x.x can ping its gateway interface).
  • ACLs look good (I have inside_in and inside_out any/any set up with tcp, ip, udp, icmp, and gre permitted, outside_in set up with inside-network/any), FW rules look good, IPs look good. Even though it looks good to me, I have a feeling this is where the issue lies.
  • I can access the URL for my RA VPN, but login fails as authentication server is inaccessible.

Troubleshooting I've done up to this point:

  • Failed over to the secondary unit
  • Rebooted both units
  • Swapped cables and ports on the ASAs (including reconfiguring temporarily to test)
  • Bypassed switches to rule out failure, also rebooted them just in case
  • Tried to check through the logs, didn't really find anything* though I'm not 100% sure what I'm looking for
  • Ran packet-tracer on 10.91.x.x to various IPs, internal and external, and I did see drops from implicit rule, but I'm not sure why this would occur since I have explicit rules to allow traffic
  • Went to restore a previously known-good config - except that server has crashed on me (this is my utility server I was working on replacing, my fault for not having multiple backups, but it's considered sensitive data and I have a limited budget/resources to work with)
  • Compared config to my production setup, and aside from differing IPs and associated rules, I can't find where the discrepancy is.
  • Did a lot of 'toggling' of rules and settings to make sure they were actually applying appropriately (like same security and intra-interface)
  • Scoured the hell out of Google, Spiceworks, reddit, etc. Hard to find this specific issue since many of the terms lead to things like 'can't ping firewall interface' or 'can't ping from subnet A to subnet B across interfaces' etc.
  • I know there's more, brain is mush right now, again, apologies.

*I did notice I'm getting recurring syslogs -

  • 105005 (Primary) Lost Failover communications with mate on interface inside
  • 105008 (Primary) Testing Interface inside
  • 105009 (Primary) Testing on interface inside Passed

A few things for this; failover is working as intended, I did it several times while troubleshooting/replacing cables. The fact this is showing on the inside interface is a bit of a head-scratcher to me, as I have a dedicated interface for this (1/8). Config is synchronizing between the two as intended.

I have a jank remote setup right now, and it is painfully slow (sometimes up to 30 second input delay/latency). I mention this because my project manager noticed slowness/instability while trying to copy some files to development. Not sure if it's a symptom or result, but figured it best to mention it.

Here's the inside interface summary. I forced full-duplex just to test, it was on auto. I'm seeing a very large number/percentage of packets dropped here, but there are also 36k frame input errors:

Interface GigabitEthernet1/2 "inside", is up, line protocol is up Hardware is Accelerator rev01, BW 1000 Mbps, DLY 10 usec Full-Duplex(Full-duplex), Auto-Speed(1000 Mbps) Input flow control is unsupported, output flow control is off MAC address [redacted], MTU 1500 IP address 10.91.0.1, subnet mask 255.255.0.0 17298289513 packets input, 1018497271172 bytes, 0 no buffer Received 10078447114 broadcasts, 0 runts, 0 giants 36646 input errors, 0 CRC, 36646 frame, 0 overrun, 0 ignored, 0 abort 0 pause input, 0 resume input 0 L2 decode drops 28498570 packets output, 1863301746 bytes, 0 underruns 0 pause output, 0 resume output 0 output errors, 0 collisions, 0 interface resets 0 late collisions, 0 deferred 0 input reset drops, 137 output reset drops input queue (blocks free curr/low): hardware (1936/1819) output queue (blocks free curr/low): hardware (2047/1880) Traffic Statistics for "inside": 10068774740 packets input, 565640511567 bytes 28498570 packets output, 841532575 bytes 9042963696 packets dropped 1 minute input rate 155031 pkts/sec, 8612806 bytes/sec 1 minute output rate 309 pkts/sec, 8925 bytes/sec 1 minute drop rate, 138571 pkts/sec 5 minute input rate 170719 pkts/sec, 9507231 bytes/sec 5 minute output rate 387 pkts/sec, 11070 bytes/sec 5 minute drop rate, 154315 pkts/sec 

Any help would be incredibly appreciated. Even if you don't have any suggestions, thank you for taking the time to read.

Side note - what are you guys doing for logging? My logs always seem to fall short of my expectations. I've tried to look into best practices for this, but I couldn't find much. I also had issues trying to send these logs to our SIEM, but that's a different can of worms.



No comments:

Post a Comment