High CPU but no failover

Ryan_Ryan · ‎2024-10-28

Hi, we had two scenarios over the past 3 months where primary cluster member high cpu (98%+) on all cores and was dropping connections and causing general network chaos for several hours, but the issue was it was 'alive enough' to respond to corexl heartbeats and remain active for 3 hours till it finally failed over, the load was so bad we couldn't even establish an ssh session to it.

Is there anyway we can have this type of resource exhaustion cause a pnote and failover?

They are R81.20, will be patching to latest jumbo this week.

the_rock · ‎2024-10-28

I know this may sound silly what I will say, but I had seen cases where cpu is at 99% and failover does NOT happen. Honestly, I have no clue if there is an official "threshold" for things like this, but I had never seen one. Not sure if updating ti jumbo 89 (recommended take) would fix your issue, but worth a try.

Andy

Ryan_Ryan · ‎2024-10-28

That's not silly at all, its exactly what we saw as well some cores, particularly core 0 was at 99%.

It could be done via a bash script but I don't really want to go down that path.

the_rock · ‎2024-10-29

For what its worth, I had seen cases where say if this happened on 15000 series appliances, failover did NOT happen, but if it was 6000 series, it would have happened, so clearly it has to do with how powerful appliances are or how many cores they have.

Andy

emmap · ‎2024-10-28

It depends on why the CPU is at 99%. If it's at 99% because there's a load issue, the load simply moves to the other cluster member upon failover and nothing is resolved - in fact it's probably made worse due to the extra overhead of the failover occurring, causing a bigger/worse outage. So we don't necessarily want to code in load related failovers.

Would suggest that if you have out-of-band access to the gateway (LOM/console) set up, you may have an easier time getting in to the CLI to check things out as you don't have to try wrestle an SSH connection in. Worst case, if you have LOM access you can power cycle the gateway to force the failover.

In R82, setting up an ElasticXL cluster could also help, as you will want to size a 2-gateway EXL cluster such that neither gateway is utilised more than 40%, to maintain HA. This way, a load related resource utilisation spike is absorbed by having extra overhead there. You also may find that if you have at least one member behaving nicely you can set the other one down from there, depending on the circumstance.

Ryan_Ryan · ‎2024-10-28

Yes that is a fair call the load may just get shifted around, in this case it wasnt traffic generated load but I do understand the chaos of flipping back and forward if it was load generated.

As these are vm's we only have access to console via vsphere which we also could not get into due to the network outage (working on ways to get around that for next time)

Lesley · ‎2024-10-29

Maybe shut a switch port that is connected to this firewall? If active unit has less interfaces up it will do failover.

Failover based on high load I have never seen and would not recommend.

Maybe this is something:

Management Data Plane Separation (MDPS)

https://support.checkpoint.com/results/sk/sk138672

-------
If you like this post please give a thumbs up(kudo)! 🙂

AkosBakos · ‎2024-10-29

Hi @Ryan_Ryan

CheckPoint CUL mode (Cluster Under Load)

As I know, there is a threshold at 80% where the CUL mode is enabled. During this mode, the cluster state freezes.

check this SK: https://support.checkpoint.com/results/sk/sk92723

To summarize it, I don't think this kind of situaton triggers a failover.

@the_rock What is your opinion?

Akos

----------------
\m/_(>_<)_\m/

the_rock · ‎2024-10-29

Well, makes sense, though I dont know if there is an official article or statement somewhere that says what thresholds are there for clustering processes to trigger failover (ie processes from command cphaprob -l list).

Because, lets be realistic and logical...IF cpu reaches say 80%, to me personally, thats good enough reason for fw to failover. Cause lets be honest, Im sure IT admin for a big bank would not feel overly comfortable having fw under 80% cpu load keep processing the traffic for a very long time...

But again, just my thinking.

Andy

Are you a member of CheckMates?

High CPU but no failover