- Products
- Learn
- Local User Groups
- Partners
- More
What's New in R82.10?
10 December @ 5pm CET / 11am ET
Improve Your Security Posture with
Threat Prevention and Policy Insights
Overlap in Security Validation
Help us to understand your needs better
CheckMates Go:
Maestro Madness
Hi, we had two scenarios over the past 3 months where primary cluster member high cpu (98%+) on all cores and was dropping connections and causing general network chaos for several hours, but the issue was it was 'alive enough' to respond to corexl heartbeats and remain active for 3 hours till it finally failed over, the load was so bad we couldn't even establish an ssh session to it.
Is there anyway we can have this type of resource exhaustion cause a pnote and failover?
They are R81.20, will be patching to latest jumbo this week.
I know this may sound silly what I will say, but I had seen cases where cpu is at 99% and failover does NOT happen. Honestly, I have no clue if there is an official "threshold" for things like this, but I had never seen one. Not sure if updating ti jumbo 89 (recommended take) would fix your issue, but worth a try.
Andy
That's not silly at all, its exactly what we saw as well some cores, particularly core 0 was at 99%.
It could be done via a bash script but I don't really want to go down that path.
For what its worth, I had seen cases where say if this happened on 15000 series appliances, failover did NOT happen, but if it was 6000 series, it would have happened, so clearly it has to do with how powerful appliances are or how many cores they have.
Andy
It depends on why the CPU is at 99%. If it's at 99% because there's a load issue, the load simply moves to the other cluster member upon failover and nothing is resolved - in fact it's probably made worse due to the extra overhead of the failover occurring, causing a bigger/worse outage. So we don't necessarily want to code in load related failovers.
Would suggest that if you have out-of-band access to the gateway (LOM/console) set up, you may have an easier time getting in to the CLI to check things out as you don't have to try wrestle an SSH connection in. Worst case, if you have LOM access you can power cycle the gateway to force the failover.
In R82, setting up an ElasticXL cluster could also help, as you will want to size a 2-gateway EXL cluster such that neither gateway is utilised more than 40%, to maintain HA. This way, a load related resource utilisation spike is absorbed by having extra overhead there. You also may find that if you have at least one member behaving nicely you can set the other one down from there, depending on the circumstance.
Yes that is a fair call the load may just get shifted around, in this case it wasnt traffic generated load but I do understand the chaos of flipping back and forward if it was load generated.
As these are vm's we only have access to console via vsphere which we also could not get into due to the network outage (working on ways to get around that for next time)
Maybe shut a switch port that is connected to this firewall? If active unit has less interfaces up it will do failover.
Failover based on high load I have never seen and would not recommend.
Maybe this is something:
Management Data Plane Separation (MDPS)
https://support.checkpoint.com/results/sk/sk138672
Hi @Ryan_Ryan
CheckPoint CUL mode (Cluster Under Load)
As I know, there is a threshold at 80% where the CUL mode is enabled. During this mode, the cluster state freezes.
check this SK: https://support.checkpoint.com/results/sk/sk92723
To summarize it, I don't think this kind of situaton triggers a failover.
@the_rock What is your opinion?
Akos
Well, makes sense, though I dont know if there is an official article or statement somewhere that says what thresholds are there for clustering processes to trigger failover (ie processes from command cphaprob -l list).
Because, lets be realistic and logical...IF cpu reaches say 80%, to me personally, thats good enough reason for fw to failover. Cause lets be honest, Im sure IT admin for a big bank would not feel overly comfortable having fw under 80% cpu load keep processing the traffic for a very long time...
But again, just my thinking.
Andy
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
| User | Count |
|---|---|
| 26 | |
| 18 | |
| 10 | |
| 8 | |
| 6 | |
| 6 | |
| 6 | |
| 5 | |
| 4 | |
| 4 |
Wed 03 Dec 2025 @ 10:00 AM (COT)
Última Sesión del Año – CheckMates LATAM: ERM & TEM con ExpertosThu 04 Dec 2025 @ 12:30 PM (SGT)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - APACThu 04 Dec 2025 @ 03:00 PM (CET)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - EMEAThu 04 Dec 2025 @ 02:00 PM (EST)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - AmericasWed 03 Dec 2025 @ 10:00 AM (COT)
Última Sesión del Año – CheckMates LATAM: ERM & TEM con ExpertosThu 04 Dec 2025 @ 12:30 PM (SGT)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - APACThu 04 Dec 2025 @ 03:00 PM (CET)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - EMEAThu 04 Dec 2025 @ 02:00 PM (EST)
End-of-Year Event: Securing AI Transformation in a Hyperconnected World - AmericasAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY