Solved: Management Server is stuck - user is unable to run...

Thomas_Eichelbu · ‎2024-06-11

Hello team,

recently we stumbled over three issues on three totaly independet customer who run into this issue:

The MGMT server stopps to execute all kind of Check Point commands.
only "cpwd_admin list" worked and showed all processes as "E" not "T".

Even "reboot" or "init6" stop to work.
only a power cycle via VmWare or similar is possible to regain control.

if the MGMT is down, the Check Point CA is down, which is very unheathly for all VPN tunnels from the same MGMT based on certificates.
"invalid certificate" messages are then shown in the log.

cpm.elg says:
08/06/24 08:09:44,534 ERROR tracker.dataLog.TrackerDataSenderSvcImp [taskExecutor-31]: AuditLogsToTrackerSender: Unable to connect fwm (down), Exception: Could not receive Message.

and it throws a ton on java errors in cpm.elg.

we saw this on three different customers, all on R81.20 HFA 53/65

and yes there is that sk:
https://support.checkpoint.com/results/sk/sk173405

it did not work for me to kill "autoupdater" ...

anybody noticed the same?
we have two CP cases ongoing!

best regards

Natan_Chamilevs · ‎2024-06-23

Hi everyone,

The issue was identified and fix for it will be released in the next Jumbo HF (which should be released in the coming weeks).

The issue is caused by a race condition in CPD, where scheduled events might result in zombie processes. This is not a cause of a specific change/fix and can be seen in all versions.

The issue is documented in sk182370 and Checkpoint Support can provide a hotfix, until the fix is released in Jumbo HF.

Natan

View solution in original post

Lesley · ‎2024-06-11

You could consider to disable the CRL check (less secure). It is a workaround during the time you figure it out with TAC.

At least it will maybe give you some rest if you are not able to power cycle the unit right away.

https://support.checkpoint.com/results/sk/sk21156

-------
Please press "Accept as Solution" if my post solved it 🙂

Thomas_Eichelbu · ‎2024-06-11

Hello Lesley,

well yes i know that, but thats not the problem itself ...

the problem is, the MGMT becomes unusable, since no services run and no CP commands can be executed.
Thats the root issue that lead to the outcome of an unreachable CRL ...
And when the MGMT stops working you have no more chance to apply your SK sk21156 since, you cannot connect to MGMT database anymore to push policy 🙂

Lesley · ‎2024-06-11

I understand that. And SK can be done before the problem occurs. Atleast the tunnels will stay up

-------
Please press "Accept as Solution" if my post solved it 🙂

the_rock · ‎2024-06-11

Does guidbedit load or that does not work either? I assume rebooting the mgmt does not make a difference?

Best,
Andy
"Have a great day and if its not, change it"

Duane_Toler · ‎2024-06-11

Check your cp_mgmt SIC certificate. The "fwm" is down and the certificate errors you stated indicate you may have a problem there.

cpca_client lscert -kind SIC |grep cp_mgmt -A 2

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-11

Good command!

Best,
Andy
"Have a great day and if its not, change it"

Thomas_Eichelbu · ‎2024-06-11

well i see some expired certificates, but the majority of the certificates is still valid.
sorry i cannot post much of the output since it all contains personal data and so on ...

[Expert@ABCDEF:0]# cpca_client lscert -kind SIC |grep cp_mgmt -A 2
Subject = CN=cp_mgmt,O=ABCDEF.X.X.X.X.Y.Y.Y.4cn4gu
Status = Valid Kind = SIC Serial = 7622 DP = 0
Not_Before: Tue May 25 14:22:35 2021 Not_After: Mon May 25 14:22:35 2026

But if the SIC certificate is expired or revoked or anything but valid the MGMT would not stay completely down. And i still could run any CP commands on cli.
And if the SIC certificate is invalid a reboot would not help here ...

As i said the MGMT server is not running nor any services nor running any Check Point CLI commands works.
Iam pretty sure it will be stuck by tomorrow again ... we will see.

Duane_Toler · ‎2024-06-11

Ok that's good to rule out. Have you done the handful of sanity checks as well? I would expect you have, but again just to rule them out:

check disk space
check OS logs to make sure nothing weird is there
Since it's a VM, make sure the hypervisor host is ok:
- datastore disk space
- datastore access to the SAN or local storage
- hypervisor RAM
Run CPM doctor ($FWDIR/scripts/run_cpmdoc.sh) when the host is functioning normally

If you are able to run OS commands, but not Check Point commands, then that does sound like an issue with the registry file like the SK indicated.

Try this, too: If the host is functioning now, do a controlled reboot just to see how it behaves. Since you have a pattern of the host misbehaving on an interval, see if this controlled reboot "buys" you more time for that interval before the next occurrence of the issue. Then look at the CPM debug topics and enable debug for Solr and webservices. These may give you an additional clue while you wait on TAC.

https://support.checkpoint.com/results/sk/sk115557

Likewise, you may also want to do a separate debug of CPD:
https://support.checkpoint.com/results/sk/sk86320

If it exhibits the issue again, a close examination of the CPM debug *should* point to the issue at the moment it occurs.

As a heads-up: TAC may give you the solr_cure process as part of the troubleshooting (sk140394, but it's a TAC internal SK).

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Thomas_Eichelbu · ‎2024-06-11

this are all good points.

CPM Doctor did not show any negative things, all green.
disk space is all good.
didnt run SOLR Cure yet
since iam not controlling the VMware infrastructure, i rely on third party to check this.

the_rock · ‎2024-06-11

What did TAC come back with?

Best,
Andy
"Have a great day and if its not, change it"

Thomas_Eichelbu · ‎2024-06-11

nothing so far ...
iam still waiting.

the_rock · ‎2024-06-11

Ok, fair enough. As far as zombie processes, I know that usually fixed by doing cpstop; cpstart or reboot, but does not sound that would do much here. And since you said it does happen on R81.20 jumbo 65 as well, they cant really ask you to install any other jumbo hotfix. Now, to comment for cpm doctor, if that does not show any errors, tells me most likely database is clean.

Just wondering, how much ram is there on these servers, any idea?

Andy

Best,
Andy
"Have a great day and if its not, change it"

Pauli · ‎2024-06-11

@Thomas_Eichelbu

Hi,
are there a lot of zombie processes running on the server? Our server currently has this error and the server is therefore unusable. A restart will temporarily fix the error

best regards

Thomas_Eichelbu · ‎2024-06-11

Hello, oh yes many Zombies ...

HCP from customer B

grep for Zombies on Customer A, here HCP doesnt run, it dies on licence check, a different story ... maybe?

endless rows of:
[Expert@ABCDEF:0]# ps aux | grep Z | more
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
admin 4653 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4654 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4655 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4657 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4658 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4659 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>
admin 4661 0.0 0.0 0 0 ? Z 08:38 0:00 [cpd] <defunct>

customer C, the lucky guy!
no Zombies,

[Expert@HIJKLMNO:0]# ps aux | grep Z | more
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
admin 24765 0.0 0.0 2652 572 pts/1 S+ 18:44 0:00 grep --color=auto Z

https://support.checkpoint.com/results/sk/sk182370 is about Zombis on CPD ...

maybe an license issues, on customer A & B is see licensed issued for different IP´s on the SMS ...
(usage of aliases and so on )
Customer C has licensees issued only for its own real IP.

Duane_Toler · ‎2024-06-11

Incredibly curious, indeed. Have you been able to get the hotfix mentioned in that SK?

I happened to check one of my customers, and I also see them with numerous defunct CPD processes. Theirs is a CloudGuard management server, but I have many other customers with the same deployment (with same Azure template and VM size). I went through a bunch of logs and didn't find any smoking guns. I found some concerning logs, but other customers have the same, without issue.

I'm going to request that hotfix from TAC for my one customer, like yours. Looks like we have the same bug. 😔

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Thomas_Eichelbu · ‎2024-06-11

Well i had a long phone call with TAC to summarize all things ...
But he didn't said much about the zombies. Maybe they are not so horrible as they sound. At least he didn't paid much attention to it.

And i got no Hotfix for the CDP zombies as mention in SK sk182370.
Honestly i didn't requested one.

So i have opened two cases for two customer, they are ongoing. lets see what TAC will find out.

the_rock · ‎2024-06-12

Hey, any new updates or nothing yet?

Andy

Best,
Andy
"Have a great day and if its not, change it"

Duane_Toler · ‎2024-06-17

I got the hotfix from TAC mentioned in the SK (and as a portfix for JHF 53). I installed it on a customer's troublesome SmartCenter this evening, so I'll know by morning time or so when Nagios complains about it. If Nagios hasn't hollered by noon, I'll check it myself and let you know how it goes.

Meanwhile, that script I posted should help you. Good luck!

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Chris_Wilson · ‎2024-06-17

We have the same thing with 1 of our CMA's on our MDS. At least for us, a reboot of the MDS will fix things and we will be fine for another 6-10 days until we have the CPD problem. Working with TAC, the one engineer said that T150 was supposed to contain a fix for CPD with defunct status. I assume this might be the same hotfix that sk182370 mentions. We are currently running T141.

The first time we had this problem was May 30, before we installed T141. T141 was recommended at the time the hotfixes for the vpn problem came out, so we went with that

With CPD in the defunct status, we had the SIC issues too and we started to have the vpn tunnels start failing as I would say it would have been over the 24 hours that the remote firewall had checked in with the mgmt server to check SIC.

Duane_Toler · ‎2024-06-17

My customer with this issue hasn't had the CPD defunct process situation escalate to total server outage yet. I have Nagios monitoring the system process counts frequently, so I am able to get to it and restart CPD with the "cpwd_admin" commands in a controlled state.

If you're desperate, make yourself a cron job to do it, too.

EDIT: I made a real script today that will do everything we need (MDS top-level, MDS per-domain, SMS, EPM, SME) and posted it in the ToolBox:

https://community.checkpoint.com/t5/Scripts/Restart-CPD-script/m-p/217862/highlight/true#M1159

[Expert@cpmgmt01:0]# ./cpd_restart.sh -h
  cpd_restart.sh:  Restart CPD process on Multi-Domain server and Security/Endpoint management

  Usage: ./cpd_restart.sh  [ -d [ ALL | <specific domain server> ] | [ -h ]

  Options:
    d     Specify a single domain management server (CMA) or special word ALL for all domain
          servers listed in "mdsstat" output (Optional; only relevant for MDS)
    h     This help

  If no argument is given, then the top level CPD process is restarted (for the MDS itself,
  Security Management server, or Endpoint Management server)

Run it with a "-d ..." to restart CPD on a given domain server if that's your troublesome one, or "-d ALL" to restart CPD on all domain servers. This only restarts CPD and leaves the other processes alone, so there's no outage. It uses the same methods that Check Point's own scripts use (shameless stole the commands out of $MDSDIR/scripts/cpshared). This ensures CPD restart is done the correct way and gets re-attached to CPWD for monitoring.

If you just have a single Security Management server, then don't give any arguments and it'll just restart the one process, or the MDS root CPD process.

Put that script in /home/admin, chmod 755, then set a job in CLISH:

> add cron job CPD_Restart command "/home/admin/cpd_restart.sh" recurrence hourly hours all at 00 

> show cron job CPD_Restart recurrence 
Every day at every hour at the 00 minutes.

or for MDS:

> add cron job CPD_Restart command "/home/admin/cpd_restart.sh" recurrence hourly hours all at 00 
> add cron job CPD_Restart_domains command "/home/admin/cpd_restart.sh -d ALL" recurrence hourly hours all at 05

> show cron job CPD_Restart recurrence 
Every day at every hour at the 00 minutes.
> show cron job CPD_Restart_domains recurrence 
Every day at every hour at the 05 minutes.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-17

Great commands.

Best,
Andy
"Have a great day and if its not, change it"

Duane_Toler · ‎2024-06-17

whoaaaaaa.... I just happened to notice this on another customer's MDS just now! R81.20 JHF 41, tho. It's not directly Internet accessible inbound, either.

$ grep -c defunct cpd\ -\ ps\ list.txt 
2426

This customer has a mere 8 management domains. Every domain has hundreds of defunct processes. Sheesh. I couldn't easily get CPD for each domain stopped, so I just ran mdsstop;mdsstart and ate it. Yikes.

I'll be keeping a closer eye on these now, and running that cron job I just fabricated. 🙂 I'm going to edit that ad hoc script and add a "ps h -C cpd" and save that to a running debug file for some tracking.

Like I said, JHF 41 here. I can throw on a newer one, but my other customer has JHF 53 with the issue. I have other customer management servers with JHF 41, but no issue.

I already have a TAC case open for the JHF 53 customer, too. I received a hotfix today for that one but haven't added it yet. I'll do that later tonight.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-17

Let us know how it goes.

Best,
Andy
"Have a great day and if its not, change it"

emmap · ‎2024-06-17

For clarity, the fix in that SK is not included in any JHF take at this time.

Duane_Toler · ‎2024-06-18

Thanks for confirming! This is what I was expecting as well.

FYI all: "Morning" has come, and on my one host with the hotfix, so far zero defunct CPD processes! 🤞

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-18

Great news!

Best,
Andy
"Have a great day and if its not, change it"

Chris_Wilson · ‎2024-06-18

That is good to hear! Do you have the name of the file or can you post your case SR#, so I can give it to my engineers too?

I was surprised after rebooting our MDS yesterday, that in 24 hours I had so many defunct processes. Your script works great though.

[Expert@Q93-FW-MDS:0]# ps h -C cpd
1280 ? Sl 0:03 /opt/CPmds-R81.10/customers/zzzzz-FW-MGMT/CPshrd-R81.10/bin/cpd
7449 ? Z 0:00 [cpd] <defunct>
9602 ? Ssl 1:51 cpd
10317 ? Z 0:00 [cpd] <defunct>
14559 ? Sl 10:36 /opt/CPmds-R81.10/customers/yyyyyy-FW-MGMT/CPshrd-R81.10/bin/cpd
17343 ? Z 0:00 [cpd] <defunct>
17344 ? Z 0:00 [cpd] <defunct>
17345 ? Z 0:00 [cpd] <defunct>
17346 ? Z 0:00 [cpd] <defunct>
17347 ? Z 0:00 [cpd] <defunct>
17348 ? Z 0:00 [cpd] <defunct>
17349 ? Z 0:00 [cpd] <defunct>
17350 ? Z 0:00 [cpd] <defunct>
17351 ? Z 0:00 [cpd] <defunct>
17353 ? Z 0:00 [cpd] <defunct>
17354 ? Z 0:00 [cpd] <defunct>
17355 ? Z 0:00 [cpd] <defunct>
17356 ? Z 0:00 [cpd] <defunct>
17359 ? Z 0:00 [cpd] <defunct>
17360 ? Z 0:00 [cpd] <defunct>
17361 ? Z 0:00 [cpd] <defunct>
17362 ? Z 0:00 [cpd] <defunct>
17363 ? Z 0:00 [cpd] <defunct>
17364 ? Z 0:00 [cpd] <defunct>
17365 ? Z 0:00 [cpd] <defunct>
17366 ? Z 0:00 [cpd] <defunct>
17367 ? Z 0:00 [cpd] <defunct>
17368 ? Z 0:00 [cpd] <defunct>
17369 ? Z 0:00 [cpd] <defunct>
17370 ? Z 0:00 [cpd] <defunct>
19500 ? Z 0:00 [cpd] <defunct>
19501 ? Z 0:00 [cpd] <defunct>
19502 ? Z 0:00 [cpd] <defunct>
19503 ? Z 0:00 [cpd] <defunct>
19504 ? Z 0:00 [cpd] <defunct>
19505 ? Z 0:00 [cpd] <defunct>
19506 ? Z 0:00 [cpd] <defunct>
19507 ? Z 0:00 [cpd] <defunct>
19508 ? Z 0:00 [cpd] <defunct>
19509 ? Z 0:00 [cpd] <defunct>
19510 ? Z 0:00 [cpd] <defunct>
19511 ? Z 0:00 [cpd] <defunct>
19512 ? Z 0:00 [cpd] <defunct>
19513 ? Z 0:00 [cpd] <defunct>
19514 ? Z 0:00 [cpd] <defunct>
19515 ? Z 0:00 [cpd] <defunct>
19516 ? Z 0:00 [cpd] <defunct>
19517 ? Z 0:00 [cpd] <defunct>
19518 ? Z 0:00 [cpd] <defunct>
19519 ? Z 0:00 [cpd] <defunct>
19520 ? Z 0:00 [cpd] <defunct>
19521 ? Z 0:00 [cpd] <defunct>
19522 ? Z 0:00 [cpd] <defunct>
19523 ? Z 0:00 [cpd] <defunct>
19524 ? Z 0:00 [cpd] <defunct>
19630 ? Sl 6:59 /opt/CPmds-R81.10/customers/xxxxxx-FW-MGMT/CPshrd-R81.10/bin/cpd
31645 ? Z 0:00 [cpd] <defunct>
31702 ? Z 0:00 [cpd] <defunct>
31705 ? Z 0:00 [cpd] <defunct>
31706 ? Z 0:00 [cpd] <defunct>
31707 ? Z 0:00 [cpd] <defunct>
31708 ? Z 0:00 [cpd] <defunct>
31709 ? Z 0:00 [cpd] <defunct>
31710 ? Z 0:00 [cpd] <defunct>
31713 ? Z 0:00 [cpd] <defunct>
31714 ? Z 0:00 [cpd] <defunct>
31715 ? Z 0:00 [cpd] <defunct>
31718 ? Z 0:00 [cpd] <defunct>
31719 ? Z 0:00 [cpd] <defunct>
31720 ? Z 0:00 [cpd] <defunct>
31723 ? Z 0:00 [cpd] <defunct>
31724 ? Z 0:00 [cpd] <defunct>
31725 ? Z 0:00 [cpd] <defunct>
31726 ? Z 0:00 [cpd] <defunct>
31727 ? Z 0:00 [cpd] <defunct>
31728 ? Z 0:00 [cpd] <defunct>
31729 ? Z 0:00 [cpd] <defunct>
31730 ? Z 0:00 [cpd] <defunct>
31731 ? Z 0:00 [cpd] <defunct>
31732 ? Z 0:00 [cpd] <defunct>
31733 ? Z 0:00 [cpd] <defunct>
31735 ? Z 0:00 [cpd] <defunct>

Duane_Toler · ‎2024-06-18

The current hotfix is JHF-specific, since it's not published. You can't fetch it as a private package, either; TAC has to issue it to you. Looks like your management is R81.10, so the hotfix I have is useless to you anyway (mine is R81.20 JHF 53). Like Emma said above, this also isn't in any future JHF, so if you have this installed, you'll have to remove it to apply a new JHF and request a new JHF-specific portfix for it (obviously, you'll do this first).

To get the hotfix, open a case, upload a cpinfo of your management (cpinfo -s SR_NUMBER -x) and hopefully they'll get to you soon. Just mindful of that future JHF-breakage.

The task IDs for this are: PRJ-44852, PMTR-96733, PMTR-100459. I dunno which one is specific to this issue, but I'd wager it's PRJ-44852.

Glad the script helps! Big thanks to @Thomas_Eichelbu for raising attention on this!

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Chris_Wilson · ‎2024-06-18

yes, I understand. But if I give them the task ID's, they can review and hopefully create a port fix for R81.10 sooner than later.

Are you a member of CheckMates?

Management Server is stuck - user is unable to run any command, seen many times!