Firewall goes down and failover does not occur

Etheldra_Freder · ‎2019-01-21

Good Day All

For about a month we have had issues with a firewall pair where FW1 goes down and the secondary firewall does not take over the active role. It remains in "standby" mode. Once you disconnect the cable (Mgmt), it fails over but you cannot ssh to it. You have ssh to FW1 via FW2, then bring it up from there. We have shutdown the port on switch for FW1, it failed over as well, but ssh was not accessible. After getting to FW1 via ssh, use ifconfig to bring the sync port up, then were able to ssh to it. Both firewalls have been replaced along with all of the cables.

I believe this may be an issue with the process that deals with failover/heartbeat between firewall.

They are asking for another replacement of the firewalls, but I am not sure that his is a hardware issue.

Has anyone ever dealt with this and what was your solution?

Thanks

Vladimir · ‎2019-01-21

Is your Sync interface on a dedicated port or is it a part of a cluster/sync setup?

Etheldra_Freder · ‎2019-01-21

It is part of a cluster/sync setup.

Vladimir · ‎2019-01-21

Can you post the cluster topology, (obviously obfuscating public IPs)?

Etheldra_Freder · ‎2019-01-21

Is this what you needed?

Vladimir · ‎2019-01-21

Take a look at Monitoring of VLAN interfaces in ClusterXL and see if it is applicable, as by default only lowest and highest VLANs are being monitored.

Additionally, I'd suggest changing CCP mode to broadcast.

Also, your management interface for this cluster is configured as "External".

While it is not necessarily an issue, your connectivity to the standby member may be affected, if you are connecting to it from the network not directly attached to it:

https://community.checkpoint.com/message/13560-problem-accessing-standby-cluster-member-from-non-loc...

Vladimir · ‎2019-01-21

One more thing of note, and I am not sure if it is or is not an issue, is the name of the cluster interface mapped to eth3.30. You are using POS(021), which is using special characters. I'd like for someone else to chime in and clarify if this is an acceptable name for the interface. Dameon Welch-Abernathy‌, Valeri Loukine‌, Kaspars Zibarts‌, Timothy Hall‌, or anyone else who can speak with authority, I'd appreciate your input.

Alessandro_Marr · ‎2019-01-21

you could try run show "routed cluster-state detailed" to see more information of event. Are you running any dynamic routing protocol?

Etheldra_Freder · ‎2019-01-21

Is there a way to remove the previous post I did not remove the IPs.

Attached is the more private one

.

Etheldra_Freder · ‎2019-01-21

I was able to delete it.

Timothy_Hall · ‎2019-01-21

Are you *sure* that ClusterXL is not logging anything about the cluster right around when the outage starts? Use a log filter of "type:Control" to zero in on ClusterXL-related messages; also try looking in /var/log/messages*. Any CUL (Cluster Under Load) notifications? It is not common for a Check Point firewall with the latest GA Jumbo HFA to partially fail in such a way that it cannot pass traffic without a failover occurring, as ClusterXL has its monitoring talons sunk pretty deep into the firewall code. VRRP on the other hand is another matter entirely (split-brain and routing black hole cocktails anyone?), but I digress...

If ClusterXL/messages file is not uttering a peep until you cause a manual failover, that usually suggests some kind of Layer 2 or Layer 3 issue is occurring such as Proxy ARP not working any more, duplicate IP/MAC address, a routing flap, or maybe even a switch STP issue. You can sort of confirm this by setting up monitoring of upstream and downstream IP address in ClusterXL here to cause a failover to occur: sk35780: How to configure $FWDIR/bin/clusterXL_monitor_ips script to run automatically on Gaia / Sec...

Could also be some kind of transient resource issue on the firewall but that is unlikely, you can check out this possibility by running cpview in historical mode (-t) and stepping back to the point in time right before the outage started. Unfortunately the only way to really figure this out is to get on the console of the active firewall when the outage is happening, and look at traffic trying to enter and leave the firewall interfaces with tcpdump (assuming there is any).

--

CheckMates Break Out Sessions Speaker

CPX 2019 Las Vegas & Vienna - Tuesday@13:30

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Etheldra_Freder · ‎2019-01-30

Good Afternoon I wanted to give you all an update on this issue.

My co-worker looked onto the switch and saw lots of error (up/down), when he researched the error, it was a known bug. I forgot the name of it but it had to do something with the switch not allowing arp requests. There was a fix for the bug , however, it was not applied to this particular switch. Once that was done, the firewalls the symptoms changed. When there is a change in the sync port (up/down(), it started to fail over as it should.
We are experiencing flapping on eth5 but we think it was due to bad cable.

Thank you everyone for your input.

Are you a member of CheckMates?

Firewall goes down and failover does not occur