Solved: Management Server is stuck - user is unable to run... - Page 2

Thomas_Eichelbu · ‎2024-06-11

Hello team,

recently we stumbled over three issues on three totaly independet customer who run into this issue:

The MGMT server stopps to execute all kind of Check Point commands.
only "cpwd_admin list" worked and showed all processes as "E" not "T".

Even "reboot" or "init6" stop to work.
only a power cycle via VmWare or similar is possible to regain control.

if the MGMT is down, the Check Point CA is down, which is very unheathly for all VPN tunnels from the same MGMT based on certificates.
"invalid certificate" messages are then shown in the log.

cpm.elg says:
08/06/24 08:09:44,534 ERROR tracker.dataLog.TrackerDataSenderSvcImp [taskExecutor-31]: AuditLogsToTrackerSender: Unable to connect fwm (down), Exception: Could not receive Message.

and it throws a ton on java errors in cpm.elg.

we saw this on three different customers, all on R81.20 HFA 53/65

and yes there is that sk:
https://support.checkpoint.com/results/sk/sk173405

it did not work for me to kill "autoupdater" ...

anybody noticed the same?
we have two CP cases ongoing!

best regards

Duane_Toler · ‎2024-06-18

Yep, those 3 task IDs were the ones I found in the hotfix's "crs.xml" file:

$ cat crs/fw1/crs.xml 
<pkg display="'Check" module="fw1_wrapper" type="HF" version="R81_20_JHF_T53_483_MAIN" date="06/11/2024 14:58">
	<fix name="libEntMgrMgmtSync.so" crs="PRJ-44852,PMTR-96733,PMTR-100459"/>
</pkg>

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Thomas_Eichelbu · ‎2024-06-18

yes thank you @Duane_Toler

But the issue we face is more then just a bunch of "defunct cpd´s"
We have the issue that some MGMT server are totally stuck and do not execute any commands from cli.
They need a powercycle to become alive.
There is the possibility that specific scripts we run at night cause those those issues.
Those script run alot "cprid" commands.
But even if our scripts are the root cause, we saw the issues happened ONLY on weekend, never during the week.

so "defunct cpd" is good to be solved, but in our case just a sideshow.

best regards

Chris_Wilson · ‎2024-06-18

absolutely, the defunct cpd processes are something that leads to the problem. We too had the problem that a CMA was essentially down, vpn tunnels started going down because the remote firewall couldn't talk to the CMA to re-establish Sic or whatever it does every 24 hours. We didn't have this problem until about May 24th and we hadn't updated any hotfixes until beginning of June when the vpn vulnerability showed up, so applying the hotfix wasn't the cause of our issue.

Duane_Toler · ‎2024-06-18

Do you have something that is able to monitor the process count, such as a Nagios plugin? During the week, login and run a process check to see how it looks as time passes. You can also notice your CPU utilization increasing. I saw this on one of these hosts.

ssh expert_user@mgmt_host 'ps h -C cpd |grep -c defunct; uptime'

SSH keys will help this be super-automated, too.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

amrutupare1987 · ‎2024-06-21

We are experiencing the same issue with our management server. Following the TAC's suggestion, we moved the management server to a New VM as per their SK article and also upgraded the management server to the latest JHF take 65 under R81.20. However, after 36 hours, the same issue recurred. The problem was only resolved after a reboot. I checked the CPD process, zombie processes, and performed a health check; all indicators are green with no issues found.

TAC also doesn't have an answer for this issue. Previously, they told us it was a VM issue, so we created a new VM, but the issue wasn't resolved. We need to reboot the VM each time to gain access.

Please let me know if anyone has found a solution.

Natan_Chamilevs · ‎2024-06-23

Hi everyone,

The issue was identified and fix for it will be released in the next Jumbo HF (which should be released in the coming weeks).

The issue is caused by a race condition in CPD, where scheduled events might result in zombie processes. This is not a cause of a specific change/fix and can be seen in all versions.

The issue is documented in sk182370 and Checkpoint Support can provide a hotfix, until the fix is released in Jumbo HF.

Natan

the_rock · ‎2024-06-23

Awesome news @Natan_Chamilevs

Best,
Andy
"Have a great day and if its not, change it"

Thomas_Eichelbu · ‎2024-06-23

Aha cool,

the mentioned SK is still a bit poor in regards of description and explanations.
What does this mean "scheduled events might result in zombie processes."
So this means any custom script running as a cron job might cause an issue?

Natan_Chamilevs · ‎2024-06-24

@Thomas_Eichelbu indeed, the SK was under editing and is now released with more information.

Regarding the question on the cron job - no, it's an internal scheduling in CPD that caused the issue. The fix adds synchronization such that the scheduling won't miss and all processes will finish.

Dale_Lobb · ‎2024-06-27

As of today, sk182370 now lists a fix included in R81.10 starting from Take 152.

When can we expect a HFA fix for R81.20?

Dale

PhoneBoy · ‎2024-06-28

I assume it will be rolled into the next JHF.
You can also request a specific hotfix from TAC.

Are you a member of CheckMates?

Management Server is stuck - user is unable to run any command, seen many times!