Re: High CPU on Java thread

Ryan_Ryan · ‎2022-07-18

I know this has been asked to death but but we are running R81.10 T45 and most of the KB's I found are marked as fixed,

we have a manager running on a vm, 6 cores and 16Gb of RAM. CPU usage is showing as 100% in smartconsole all the time, and the checkpoint is firing snmp traps about cpu usage exceeding threshold's, I did a reboot and had the cpu high again within just a few minutes.

the_rock · ‎2022-07-18

Yes, you are correct, it had been asked to death, but even so, you are also correct that most SKs about it show as resolved. I can only tell you from personal experience that I helped one customer fix this issue by up-ing the RAM to 32 GB and they never had this problem again,

Would that work for you? I have no clue, but worth trying if you can.

Best,
Andy

Ryan_Ryan · ‎2022-07-18

Yep I can do that, will give it a try and report back!

Chris_Atkinson · ‎2022-07-18

Agreed, the output shows that swap is being used so for a VM this is a relatively quick win/test.

Beyond that we likely need to better understand the context of the issue, is this a recent upgrade how many gateways is it supporting etc.

CCSM R77/R80/ELITE

Timothy_Hall · ‎2022-07-19

That is way too many postgres instances for 6 cores and 16GB of RAM. The high CPU is not caused by log indexing as the busy processes are not renice'd to minimum priority. Hypervisor is not stealing CPU cycles (0% st) and there is no hard drive contention (0% wa). Looks like some kind of database issue to me, try purging some old revisions: sk172473: Multiple Postgres Processes Consume 100% CPU on a Management Server.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Ryan_Ryan · ‎2022-07-19

You might be on to something there, we had a large number (guess around 500 odd) revisions, have cleared them out now and number of postgres instances has definitely dropped, can't say for sure yet if its fixed will need to wait 24 hours to verify.

thanks!

Ryan_Ryan · ‎2022-07-24

Hi Tim,

After clearing the revisions, it worked well again, however even after a few hours the issue returned, now a few days later I already have 32000 revisions again, all of them are coming from web_api user (scanning the cloud environments I suspect).

Limiting the number of revisions would fix it, but the concern is that we will lose genuine revisions if we actually need to roll back. Do you know of any way to remove revisions only for a specific user?

Timothy_Hall · ‎2022-07-25

The behavior of the code accessing the web_api needs to be corrected, there is no way it should legitimately be creating that many revisions. Almost like it is constantly publishing as a matter of course even when it is not necessary. If changing the code is not feasible you could remove write access from the account utilizing the web_api, and assuming it doesn't actually need write access to perform its monitoring tasks that will keep it from creating so many revisions.

You can configure automatic cleanup of revisions on the SMS via the API in the latest releases, but you are probably going to be fighting a losing battle against the needless revisions being constantly created via the web_api user, and there doesn't seem to be a way to cleanup revisions only from a certain user:

sk170059: Automatic Revisions Purge: overview and how-to

https://sc1.checkpoint.com/documents/latest/APIs/index.html#cli/set-automatic-purge~v1.7%20

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Ryan_Ryan · ‎2022-07-25

So some further digging reveals this is not a custom script, this script deployed itself once we added an nsx data centre, the script I believe runs every 30 seconds and scans all the cloudguard gateways for identity information. However I do not know exactly which script is responsible (plus I don't really want to modify a prebuilt script), and the user running the script is "cme" which we can't control.

according to the api.elg log, something calls these two api's in this order for each of the gateway IP's (we have 30 nsx-t gateways)

http://127.0.0.1:62708/web_api/v1.2/set-generic-object
http://127.0.0.1:62708/web_api/v1.2/publish

I'm seeing about 8 publish events per minute.

Timothy_Hall · ‎2022-07-26

Looks like it is time for a TAC case as that behavior by the script is unacceptable.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

High CPU on Java thread