Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Thomas_Eichelbu
Advisor
Advisor
Jump to solution

Management Server is stuck - user is unable to run any command, seen many times!

Hello team, 

recently we stumbled over three issues on three totaly independet customer who run into this issue:


The MGMT server stopps to execute all kind of Check Point commands.
only "cpwd_admin list" worked and showed all processes as "E" not "T".

Even "reboot" or "init6" stop to work.
only a power cycle via VmWare or similar is possible to regain control.

if the MGMT is down, the Check Point CA is down, which is very unheathly for all VPN tunnels from the same MGMT based on certificates.
"invalid certificate" messages are then shown in the log.

cpm.elg says:
08/06/24 08:09:44,534 ERROR tracker.dataLog.TrackerDataSenderSvcImp [taskExecutor-31]: AuditLogsToTrackerSender: Unable to connect fwm (down), Exception: Could not receive Message.

and it throws a ton on java errors in cpm.elg.


we saw this on three different customers, all on R81.20 HFA 53/65

and yes there is that sk:
https://support.checkpoint.com/results/sk/sk173405

it did not work for me to kill "autoupdater" ... 

anybody noticed the same?
we have two CP cases ongoing!

best regards


0 Kudos
40 Replies
Duane_Toler
Advisor

Yep, those 3 task IDs were the ones I found in the hotfix's "crs.xml" file:

$ cat crs/fw1/crs.xml 
<pkg display="'Check" module="fw1_wrapper" type="HF" version="R81_20_JHF_T53_483_MAIN" date="06/11/2024 14:58">
	<fix name="libEntMgrMgmtSync.so" crs="PRJ-44852,PMTR-96733,PMTR-100459"/>
</pkg>
0 Kudos
Thomas_Eichelbu
Advisor
Advisor

yes thank you @Duane_Toler 

But the issue we face is more then just a bunch of "defunct cpd´s" 
We have the issue that some MGMT server are totally stuck and do not execute any commands from cli.
They need a powercycle to become alive.
There is the possibility that specific scripts we run at night cause those those issues.
Those script run alot "cprid" commands.
But even if our scripts are the root cause, we saw the issues happened ONLY on weekend, never during the week.

so "defunct cpd" is good to be solved, but in our case just a sideshow.

best regards

0 Kudos
Chris_Wilson
Contributor

absolutely, the defunct cpd processes are something that leads to the problem.  We too had the problem that a CMA was essentially down, vpn tunnels started going down because the remote firewall couldn't talk to the CMA to re-establish Sic or whatever it does every 24 hours.   We didn't have this problem until about May 24th and we hadn't updated any hotfixes until beginning of June when the vpn vulnerability showed up, so applying the hotfix wasn't the cause of our issue.

0 Kudos
Duane_Toler
Advisor

Do you have something that is able to monitor the process count, such as a Nagios plugin?  During the week, login and run a process check to see how it looks as time passes.  You can also notice your CPU utilization increasing. I saw this on one of these hosts. 

ssh expert_user@mgmt_host 'ps h -C cpd |grep -c defunct; uptime'

 SSH keys will help this be super-automated, too.

0 Kudos
amrutupare1987
Explorer

We are experiencing the same issue with our management server. Following the TAC's suggestion, we moved the management server to a  New VM as per their SK article and also upgraded the management server to the latest JHF take 65 under R81.20. However, after 36 hours, the same issue recurred. The problem was only resolved after a reboot. I checked the CPD process, zombie processes, and performed a health check; all indicators are green with no issues found.

TAC also doesn't have an answer for this issue. Previously, they told us it was a VM issue, so we created a new VM, but the issue wasn't resolved. We need to reboot the VM each time to gain access.

Please let me know if anyone has found a solution.

Natan_Chamilevs
Employee
Employee

Hi everyone,

The issue was identified and fix for it will be released in the next Jumbo HF (which should be released in the coming weeks).

The issue is caused by a race condition in CPD, where scheduled events might result in zombie processes. This is not a cause of a specific change/fix and can be seen in all versions.

The issue is documented in sk182370 and Checkpoint Support can provide a hotfix, until the fix is released in Jumbo HF.

 

Natan

the_rock
Legend
Legend

Awesome news @Natan_Chamilevs 

0 Kudos
Thomas_Eichelbu
Advisor
Advisor

Aha cool,

the mentioned SK is still a bit poor in regards of description and explanations.
What does this mean "scheduled events might result in zombie processes." 
So this means any custom script running as a cron job might cause an issue?


0 Kudos
Natan_Chamilevs
Employee
Employee

@Thomas_Eichelbu indeed, the SK was under editing and is now released with more information.

Regarding the question on the cron job - no, it's an internal scheduling in CPD that caused the issue. The fix adds synchronization such that the scheduling won't miss and all processes will finish.

0 Kudos
Dale_Lobb
Advisor

As of today, sk182370 now lists a fix included in R81.10 starting from Take 152.

When can we expect a HFA fix for R81.20?

 

Dale

0 Kudos
PhoneBoy
Admin
Admin

I assume it will be rolled into the next JHF.
You can also request a specific hotfix from TAC.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    Thu 11 Jul 2024 @ 10:00 AM (BST)

    CheckMates Live London

    Tue 30 Jul 2024 @ 05:00 PM (CEST)

    Under the Hood: CloudGuard Controller Unleashed

    Thu 11 Jul 2024 @ 10:00 AM (BST)

    CheckMates Live London
    CheckMates Events