Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Ryan_Ryan
Advisor

High CPU on Java thread

I know this has been asked to death but but we are running R81.10 T45 and most of the KB's I found are marked as fixed,

 

we have a manager running on a vm, 6 cores and 16Gb of RAM. CPU usage is showing as 100% in smartconsole all the time, and the checkpoint is firing snmp traps about cpu usage exceeding threshold's, I did a reboot and had the cpu high again within just a few minutes.

 

 
 

cpu.PNG

0 Kudos
9 Replies
the_rock
Champion
Champion

Yes, you are correct, it had been asked to death, but even so, you are also correct that most SKs about it show as resolved. I can only tell you from personal experience that I helped one customer fix this issue by up-ing the RAM to 32 GB and they never had this problem again,

Would that work for you? I have no clue, but worth trying if you can.

Ryan_Ryan
Advisor

Yep I can do that, will give it a try and report back!

Chris_Atkinson
Employee
Employee

Agreed, the output shows that swap is being used so for a VM this is a relatively quick win/test.

Beyond that we likely need to better understand the context of the issue, is this a recent upgrade how many gateways is it supporting etc.

0 Kudos
Timothy_Hall
Champion
Champion

That is way too many postgres instances for 6 cores and 16GB of RAM.  The high CPU  is not caused by log indexing as the busy processes are not renice'd to minimum priority.  Hypervisor is not stealing CPU cycles (0% st) and there is no hard drive contention (0% wa).  Looks like some kind of database issue to me, try purging some old revisions: sk172473: Multiple Postgres Processes Consume 100% CPU on a Management Server.

New 2021 IPS/AV/ABOT Immersion Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
Ryan_Ryan
Advisor

You might be on to something there, we had a large number (guess around 500 odd) revisions, have cleared them out now and number of postgres instances has definitely dropped, can't say for sure yet if its fixed will need to wait 24 hours to verify.

 

thanks!

0 Kudos
Ryan_Ryan
Advisor

Hi Tim,

 

After clearing the revisions, it worked well again, however even after a few hours the issue returned, now a few days later I already have 32000 revisions again, all of them are coming from web_api user (scanning the cloud environments I suspect).

Limiting the number of revisions would fix it, but the concern is that we will lose genuine revisions if we actually need to roll back. Do you know of any way to remove revisions only for a specific user?

 

0 Kudos
Timothy_Hall
Champion
Champion

The behavior of the code accessing the web_api needs to be corrected, there is no way it should legitimately be creating that many revisions.  Almost like it is constantly publishing as a matter of course even when it is not necessary.  If changing the code is not feasible you could remove write access from the account utilizing the web_api, and assuming it doesn't actually need write access to perform its monitoring tasks that will keep it from creating so many revisions.

You can configure automatic cleanup of revisions on the SMS via the API in the latest releases, but you are probably going to be fighting a losing battle against the needless revisions being constantly created via the web_api user, and there doesn't seem to be a way to cleanup revisions only from a certain user:

sk170059: Automatic Revisions Purge: overview and how-to

https://sc1.checkpoint.com/documents/latest/APIs/index.html#cli/set-automatic-purge~v1.7%20

 

New 2021 IPS/AV/ABOT Immersion Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
Ryan_Ryan
Advisor

So some further digging reveals this is not a custom script, this script deployed itself once we added an nsx data centre, the script I believe runs every 30 seconds and scans all the cloudguard gateways for identity information. However I do not know exactly which script is responsible (plus I don't really want to modify a prebuilt script), and the user running the script is "cme" which we can't control.

 

according to the api.elg log, something calls these two api's in this order for each of the gateway IP's (we have 30 nsx-t gateways) 

http://127.0.0.1:62708/web_api/v1.2/set-generic-object
http://127.0.0.1:62708/web_api/v1.2/publish

I'm seeing about 8 publish events per minute.

 

0 Kudos
Timothy_Hall
Champion
Champion

Looks like it is time for a TAC case as that behavior by the script is unacceptable.

New 2021 IPS/AV/ABOT Immersion Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos