- Products
- Learn
- Local User Groups
- Partners
-
More
Celebrate the New Year
With CheckMates!
Value of Security
Vendor Self-Awareness
Join Us for CPX 360
23-24 February 2021
Important certificate update to CloudGuard Controller, CME,
and Azure HA Security Gateways
How to Remediate Endpoint & VPN
Issues (in versions E81.10 or earlier)
Mobile Security
Buyer's Guide Out Now
Important! R80 and R80.10
End Of Support around the corner (May 2021)
First time caller.
We are running a clustered pair of HA 13000 gateways on R77.30s. They are managed by an R80.10 server. Probably since March of 2018 we started seeing the gateways fail over due to policy pushes. We could actually force the interfaces to fail, by doing a policy push. This caused the CPU associated with the worker to hit >100%. The CPU would have the same affinity as an interface, and would crash the interface. Sometimes this would happen on the standby, sometimes this would happen on the active member. To mitigate the issue in the mean time, we would do policy pushes during off work hours. No load on the firewall. We would still see failures. About October of 2018, we started to see this more frequently and started to work more with checkpoint technicians. They have suggested a series of fixes. We have implemented a fex of the suggestions by the technicians, dynamic dispatcher, edit freeze state, CPU stability hotfix (can be found here https://community.checkpoint.com/message/28542-clusterxl-improved-stability-hotfix). None of them have seemed to address the issue. After installing the Stability hotfix, we stopped seeing the failovers during policy pushes. But now, it fails over randomly. At this point, even our sales engineer is saying "Post on Checkmates" to see if anyone else is having these issues.
I am open to suggestions, questions, queries and answers. Here is a high level list of the suggestions by the technician.
These "fixes" sound more like a workaround. Proper way would be to find root cause and fix it. Are these security gateways virtual machines or servers ? Also, how are they connected to rest of the network? Is it multicast cluster ?
Dears
Thanks for sharing your experience.
I would like to share mine because we are experiencing the same problems in a very similar scenario, the TAC doesn’t find the root cause of the problem.
We have an Open Server Firewall Cluster (HP ProLiant DL380 G8) with Cluster XL version R77.30 with a management R80.20 (Virtual) with a high resources.
Here is a brief summary of the problem history:
October 2018
We started having problems installing policies. The message error was "Operation incomplete due to timeout". The active node was completely freeze (we verified by ILO, the server doesn’t allow log in) and never do the failover to the other node.
We had to manually force the standby node to become active and in the freeze node the only solution was a force reboot and then reinstall policies.
These Firewalls had the R77.30 version installed with the Take 317, also had the Dynamic Dispatcher command and the cpus normally between 40 and 60% of its load.
This problem had been presented a few times, but at the end of the year it began to be more constant. One thing that we can saw was when apply policy and if you are doing ping to some host through to this firewall the latency was very higher with a percent of loss packets.
The TAC suggest us to installed the Take 344 (Ongoing) because included the Stability Cluster XL feature.
January 2019
After a week of having installed Take 344 and not having problems, we advise a slight improvement in performance when installing policies but AGAIN the same freeze problem.
The only difference in this point was the failover works correctly, which means there was no loss of services.
Finally, the TAC suggest us follow the SK 31511 because they don’t saw "core dumps" or logs according to the "freeze" problem.
The case in the TAC has been open for 4 months, it has gone through several engineers and escalated to high positions and nobody has told us what is the root cause of the problem. Always the suggestion is to update the last jumbo or the 80.10 version with a PC 24hours connected through serial cable with the kdb option enable until the next freeze.
Please, if someone has the solution, we will be very grateful
Regards
About CheckMates
Learn Check Point
Advanced Learning
WELCOME TO THE FUTURE OF CYBER SECURITY