Cluster member at 100 CPU

Hello Experts, 

I have upgraded checkpoint 4600 cluster from R80.10 to R80.30 (as these gateways appliances are under sized we didint go for R80.40). 

after successful upgrade on both the cluster member, secondary member went up to 100% CPU utilization and is consistent still. Checkpoint TAC suggested to apply hot fix and see if that might resolve the issue. but since it is at 100% CPU, its not allowing us to install/download hot fix. 

TOP command shows high number of clish jobs running, we tried to kill some of these jobs but it keeps re-generating. 

this node is currently standby and not interrupting traffic but this is really a concerning issue as these are prof cluster. 

any help will be appreciated.  

I've seen this type of behavior before, sounds like a corrupt Gaia configuration database or one that has grown too large.  The Gaia database is separate from the /config/active file which contains a text version of your Gaia OS config.  Older versions had issues with having the Gaia database size run out of control. 

See Scenario 2 here which details how to rebuild the Gaia database (you won't lose your Gaia config):

If that doesn't help there are a variety of other conditions that can cause this, search for string "Timeout waiting for response from database server" in SecureKnowledge and you'll get plenty of hits.


I know you said high number of clish jobs running, but is there a process in particular causing this from running maybe ps -auxw or top command? Also, what happens if you reboot the box? Not sure if you tried that...I really get the fact that 4600 might be undersized to properly run even R80.30. Though, it is a bit peculiar as to why only one would have this issue, since its same hardware.

