Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Ryan_Ryan
Advisor

High CPU but no failover

Hi, we had two scenarios over the past 3 months where primary cluster member high cpu (98%+) on all cores and was dropping connections and causing general network chaos for several hours, but the issue was it was 'alive enough' to respond to corexl heartbeats and remain active for 3 hours till it finally failed over, the load was so bad we couldn't even establish an ssh session to it.

Is there anyway we can have this type of resource exhaustion cause a pnote and failover? 

 

They are R81.20, will be patching to latest jumbo this week.

 

 

0 Kudos
8 Replies
the_rock
Legend
Legend

I know this may sound silly what I will say, but I had seen cases where cpu is at 99% and failover does NOT happen. Honestly, I have no clue if there is an official "threshold" for things like this, but I had never seen one. Not sure if updating ti jumbo 89 (recommended take) would fix your issue, but worth a try.

Andy

0 Kudos
Ryan_Ryan
Advisor

That's not silly at all, its exactly what we saw as well some cores, particularly core 0 was at 99%. 

It could be done via a bash script but I don't really want to go down that path. 

0 Kudos
the_rock
Legend
Legend

For what its worth, I had seen cases where say if this happened on 15000 series appliances, failover did NOT happen, but if it was 6000 series, it would have happened, so clearly it has to do with how powerful appliances are or how many cores they have.

Andy

0 Kudos
emmap
Employee
Employee

It depends on why the CPU is at 99%. If it's at 99% because there's a load issue, the load simply moves to the other cluster member upon failover and nothing is resolved - in fact it's probably made worse due to the extra overhead of the failover occurring, causing a bigger/worse outage. So we don't necessarily want to code in load related failovers. 

Would suggest that if you have out-of-band access to the gateway (LOM/console) set up, you may have an easier time getting in to the CLI to check things out as you don't have to try wrestle an SSH connection in. Worst case, if you have LOM access you can power cycle the gateway to force the failover. 

In R82, setting up an ElasticXL cluster could also help, as you will want to size a 2-gateway EXL cluster such that neither gateway is utilised more than 40%, to maintain HA. This way, a load related resource utilisation spike is absorbed by having extra overhead there. You also may find that if you have at least one member behaving nicely you can set the other one down from there, depending on the circumstance. 

0 Kudos
Ryan_Ryan
Advisor

Yes that is a fair call the load may just get shifted around, in this case it wasnt traffic generated load but I do understand the chaos of flipping back and forward if it was load generated.

 

As these are vm's we only have access to console via vsphere which we also could not get into due to the network outage (working on ways to get around that for next time)

 

 

0 Kudos
Lesley
Leader Leader
Leader

Maybe shut a switch port that is connected to this firewall? If active unit has less interfaces up it will do failover.

Failover based on high load I have never seen and would not recommend. 

Maybe this is something:

Management Data Plane Separation (MDPS)

https://support.checkpoint.com/results/sk/sk138672

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
AkosBakos
Advisor
Advisor

Hi @Ryan_Ryan 

CheckPoint CUL mode (Cluster Under Load)

As I know, there is a threshold at 80% where the CUL mode is enabled. During this mode, the cluster state freezes.

check this SK: https://support.checkpoint.com/results/sk/sk92723

To summarize it, I don't think this kind of situaton triggers a failover. 

@the_rock What is your opinion?

Akos

----------------
\m/_(>_<)_\m/
0 Kudos
the_rock
Legend
Legend

Well, makes sense, though I dont know if there is an official article or statement somewhere that says what thresholds are there for clustering processes to trigger failover (ie processes from command cphaprob -l list).

Because, lets be realistic and logical...IF cpu reaches say 80%, to me personally, thats good enough reason for fw to failover. Cause lets be honest, Im sure IT admin for a big bank would not feel overly comfortable having fw under 80% cpu load keep processing the traffic for a very long time...

But again, just my thinking.

Andy

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events