Saturday, November 28, 2020

Juniper SRX4200 18.4R3

This is probably a long shot but the last 2 days have been rough. We've been running 18.4R3 on this cluster for close to 2 months and in the last 2 days we've had the cluster lock up twice with 0 core dumps. First node0 stopped routing traffic early Friday morning. Only it's 1 gateway interface was pinging. I drove into work and plugged a console cable in directly with no response. Activity lights were live and no alarms.

I connected to node1 and it was set as disabled and ineligible to take over the resources. Looking into it the HA was down but the FAB was still receiving heart beats so to prevent split brain node1 disabled itself. I had to hardboot node0 and it came back up fine. I had to initiate a reboot on node1 to remove the disabled state. Everything came back up with no issues. I thought maybe this was a hardware issue on node0 so I decided to turn it off and run on node1 through the weekend.

Sure enough 11 hours later node1 did the same thing. This time I couldn't ping any interfaces. I drove into work this morning and it did the same thing. Console port didn't work at all, but there were still activity lights. I had to reboot it to get in. I pulled RSI and a log package and contacted JTAC. While waiting for them to process the logs I decided to look through them and sure enough there was nothing in the logs that was useful.

When the SRX locks up the it stops logging so there is a gap from the time it stops to the time it reboots.

JTAC was useless because there is nothing in the logs and they need a core dump, but to get a core dump I'd have to actually be able to get into the device before I reboot it which so far hasn't been possible.

Needless to say after working 8 hours and driving to work twice over what was supposed to be my long weekend I ended up reinstalling 15.1 on these which is what they've been running in the last 3 years with one hiccup where it stopped routing to its next hop. My assumption is that this is a software bug that was triggered by some conf change that was made before Thanksgiving. The conf change being adding a few hosts to a firewall rule.

Has anyone seen this issue on 18.4? I have a number of routers (SRX300 series) running 18.4 but this is the only 4200 running it.



No comments:

Post a Comment