Interface Instability Causing Cluster Failover

Daniel_Zenczak · ‎2019-02-04

First time caller.

We are running a clustered pair of HA 13000 gateways on R77.30s. They are managed by an R80.10 server. Probably since March of 2018 we started seeing the gateways fail over due to policy pushes. We could actually force the interfaces to fail, by doing a policy push. This caused the CPU associated with the worker to hit >100%. The CPU would have the same affinity as an interface, and would crash the interface. Sometimes this would happen on the standby, sometimes this would happen on the active member. To mitigate the issue in the mean time, we would do policy pushes during off work hours. No load on the firewall. We would still see failures. About October of 2018, we started to see this more frequently and started to work more with checkpoint technicians. They have suggested a series of fixes. We have implemented a fex of the suggestions by the technicians, dynamic dispatcher, edit freeze state, CPU stability hotfix (can be found here https://community.checkpoint.com/message/28542-clusterxl-improved-stability-hotfix). None of them have seemed to address the issue. After installing the Stability hotfix, we stopped seeing the failovers during policy pushes. But now, it fails over randomly. At this point, even our sales engineer is saying "Post on Checkmates" to see if anyone else is having these issues.

I am open to suggestions, questions, queries and answers. Here is a high level list of the suggestions by the technician.

CPU stability hotfix
1. Implemented Saturday January 26, 2019
Dynamic Dispatcher
1. Implemented November 29, 2018
Edit freeze state
1. Implemented Thursday January 31, 2019
Increase CCP timers
1. Implementation TBD
Keep all connections during policy push
Increase Rx-ringsize
1. Implementation TBD
Rulebase optimization
1. Implementation TBD
IPS protections optimization
1. Implementation TBD
Further optimizations via SK92348
1. Implementation TBD

HristoGrigorov · ‎2019-02-04

These "fixes" sound more like a workaround. Proper way would be to find root cause and fix it. Are these security gateways virtual machines or servers ? Also, how are they connected to rest of the network? Is it multicast cluster ?

Corporacion_Ame · ‎2019-02-06

Dears

Thanks for sharing your experience.

I would like to share mine because we are experiencing the same problems in a very similar scenario, the TAC doesn’t find the root cause of the problem.

We have an Open Server Firewall Cluster (HP ProLiant DL380 G8) with Cluster XL version R77.30 with a management R80.20 (Virtual) with a high resources.

Here is a brief summary of the problem history:

October 2018

We started having problems installing policies. The message error was "Operation incomplete due to timeout". The active node was completely freeze (we verified by ILO, the server doesn’t allow log in) and never do the failover to the other node.

We had to manually force the standby node to become active and in the freeze node the only solution was a force reboot and then reinstall policies.

These Firewalls had the R77.30 version installed with the Take 317, also had the Dynamic Dispatcher command and the cpus normally between 40 and 60% of its load.

This problem had been presented a few times, but at the end of the year it began to be more constant. One thing that we can saw was when apply policy and if you are doing ping to some host through to this firewall the latency was very higher with a percent of loss packets.

The TAC suggest us to installed the Take 344 (Ongoing) because included the Stability Cluster XL feature.

January 2019

After a week of having installed Take 344 and not having problems, we advise a slight improvement in performance when installing policies but AGAIN the same freeze problem.

The only difference in this point was the failover works correctly, which means there was no loss of services.

Finally, the TAC suggest us follow the SK 31511 because they don’t saw "core dumps" or logs according to the "freeze" problem.

The case in the TAC has been open for 4 months, it has gone through several engineers and escalated to high positions and nobody has told us what is the root cause of the problem. Always the suggestion is to update the last jumbo or the 80.10 version with a PC 24hours connected through serial cable with the kdb option enable until the next freeze.

Please, if someone has the solution, we will be very grateful

Regards

Are you a member of CheckMates?

Interface Instability Causing Cluster Failover