Re: Management problem with GAIA OS

Matlu · ‎2023-08-22

Hello, everyone.

We have a HA SMS environment, which is in version R81.10, both on Open Server.

Currently the HA SMS passive equipment, has very big problems SLOWNESS.
Every time we try to access by CLI to the equipment
(We enter by Putty, or some other terminal, the "prompt" to enter the username and password, takes an eternity to appear).

The active member of HA SMS has no problem.

When accessing after a long time to the passive equipment, the equipment fails to accept the commands you try to type, and/or if it accepts the command, and you try to have an output of the same, such as commands like (cpview, top, free -m), the equipment takes forever to display the output of the command.

We have noticed that the CPU seems to be "stressing out" and we believe this is the reason for the problem.

In this scenario, is it advisable to FRESH INSTALL the device, and reinstall the GAIA OS?

I have the impression that the OS has been damaged for some reason.

Do you have any opinion and/or experience similar to the above?

Thanks for your comments.

the_rock · ‎2023-08-22

You can never go wrong with fresh install, that will always work, for sure. But, any idea whats causing the problem? Any process consuming high cpu/memory?

Andy

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-08-22

Buddy,

The system is so slow, it is difficult to detect the fault.

When we applied the "cpview" command, we waited a long time to get results, and observed that the CPUs were "too high".

Do you have any command similar to "top" that can catch the exact process that may be consuming all the resources.

We are seriously thinking of "Formatting" the Gaia OS.

😕

the_rock · ‎2023-08-22

ps -auxw

top

free -g

Best,
Andy
"Have a great day and if its not, change it"

Timothy_Hall · ‎2023-08-22

cpstat -f os sensors

Sounds like a classic case of a CPU fan failure, and the CPU downclocking itself to a fraction of the normal clock rate. Could also possibly be a slow/nonresponding DNS server issue, use nslookup or dig to perform a couple of website name lookups to IP addresses. Slow? Timing out? Does this system have a RAID array? Could also be a symptom of a degraded RAID array or a hard drive that has partially failed. Check /var/log/messages.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-08-23

Hello, again.

This is information I have been able to collect from the SMS that is failing. It is so slow to enter the SMS prompt, that it has taken me a long time to collect the information.

With CPview, I see that the CPU is flying, obviously I can't capture how the value changes constantly, but at least the images will help to show the error I am referring to.

I hope you can give me some opinion on where you think the error might be.

the_rock · ‎2023-08-23

Its way more clear now, thanks for providing those things. I found notes from 2 years ago where we had EXACT same issue with a customer and their CP mgmt server in Azure (NOT S1C) and after having TAC case for more than a month and them running who in the world knows how many debugs and reviewing files, we ended up restoring recent backup and that permanently fixed the issue.

You may ask how that fixed the problem? Answer is...I HAVE NO IDEA lol

At the end of the day, TAC could not tell us either and at that point, Im sure customer did not care either, as long as issue was fixed, which it was. Personally, I still dont understand why java was causing such high CPU, because there were no changes made, no new version or jumbo installed, so there was zero logic in any of it.

Also, worth pointing out that in their case, there was lots of free memory, same as with your management.

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-08-23

Heck, then it is almost 99% more likely that with a new installation from 0 on the VMWARE ESXi, with respect to GAIA OS, we will solve the problem.

Well, I think it is the most "healthy" for our scenario 😞

In our case, this SMS that has problems is the Standby (We have the environment of a SMS HA).

Is there a need to break from the SmartConsole, the SMS HA, in order to reinstall from 0 the damaged MGMT?

Thanks for your comments.

Timothy_Hall · ‎2023-08-23

Your CPUs are not busy at all, but they are blocked 70-80% of the time waiting for access to the disk (I/O Wait). You have plenty of memory, so either the disk path is somehow bad/corrupt or postgres is going bonkers hitting the database due to corruption, hard to tell from the list of processes. Please provide output of iotop. Talk to your VM person and ask them how utilized the disk path is that is assigned to this VM.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-08-23

Cannot obtain information with the command "cpstat -f os all".

I applied the command "iotop", but you will understand, that the values of this command, change continuously, I hope that what I capture, serves.

Thanks for your comments and help

the_rock · ‎2023-08-23

Its cpstat os -f all

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-08-23

I have already obtained a result.

Timothy_Hall · ‎2023-08-23

It looks like the postgres database space reclamation routines (autovacuum) are stuck in a loop and pounding on the hard drive. Probably some kind of corruption in the database that will require a TAC case to fix. I assume rebooting this server does not help? I don't think killing processes and having them restart will do any good if there is indeed a problem in the database itself. Try a reboot if you haven't already.

Since this is a secondary SMS, if you don't want to uncover the root cause with TAC it probably is faster to just reload it as a secondary SMS and let it resync its full config from the primary.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-08-23

Hello, @Timothy_Hall

In fact, we have already restarted the SMS, a lot of times, and the problem persists (SMS is still "slow").

In fact, we have reopened another case with TAC, because of this "postgres" process that seems to be causing headaches for SMS.

The JAVA process that you share in the image, is it normal that it usually appears consuming a lot of CPU?

I did not understand your last suggestion, how could I "reload" the SMS, and get it to "synchronize its rulebase" again, from the main one?

Thanks for your comments.

the_rock · ‎2023-08-23

I could be mistaken when I say this, but Im pretty sure in R77 and before, when it came to mgmt HA, if say secondary was "messed up", you just fresh install it, "slap" same jumbo hotfix and it would automatically sync with the primary.

Again, I could be wrong with that statement, maybe someone else can confirm.

Andy

Best,
Andy
"Have a great day and if its not, change it"

Timothy_Hall · ‎2023-08-23

0) Document Gaia OS config of existing secondary SMS - /config/active file.

1) Fresh reload of the Secondary SMS to same code version as primary. Assign same secondary SMS Gaia OS config such as hostname, IP address, routes, etc.

2) Go through first time wizard on new system and declare as Secondary SMS; set SIC activation key.

3) Install Jumbo HFA version on secondary matching primary's.

4) Reset SIC on secondary SMS object in SmartConsole and re-establish.

5) Attach permanent license to secondary SMS.

6) Publish changes, install database, and manually sync peer from Management High Availability screen if needed.

Done.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-08-28

Hello,

We have confirmed the theory you put forward.

Indeed, at VMWARE ESXi level, there is a "deficiency" in the disk distribution.

What we want now is to reinstall the Gaia OS R81.10 (Create a new VM).

Is there any recommendation for this process?

We have a HA SMS environment, and the one that has been damaged is the Standby member.

It is just a matter of "deleting" the VM from the ESXI, and reloading it from 0, correct?

Once the SMS is loaded, is it possible that the Active member that is working fine, sends all its "policies and configuration database" to the new machine?

Greetings.

the_rock · ‎2023-08-28

Appears you would need to manually sync it, but thats literally one click.

Andy

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-08-28

Once you "hook" the SMS from Standby back to the SmartConsole (and put it in HA of SMS)

Is it possible for the active member, to send his rules database to the standby?

Or do I need to manually import the policy package to the new device?

the_rock · ‎2023-08-28

I dont believe you need to import anything, you just manually sync it.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2023-08-23

I would go with what @Timothy_Hall said, since he is way smarter than I, plus, those steps make perfect sense. Also, if it helps, below is official link with steps to follow.

Hope that helps bro.

Cheers,

Andy

https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_SecurityManagement_AdminGuide/Topi...

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-08-23

One inquiry,

I am seeing that these processes like "java" and "postgres", are literally "eating", I think the CPU.

I don't know if this is directly related to the SMS slowness problem.

Are these processes important for MGMT?
Can they be "eliminated"

PhoneBoy · ‎2023-08-24

The management server uses a Postgres database and Java for various functions, including API support.
They cannot be eliminated.
If you don't want to troubleshoot the issue with TAC, then I recommend rebuilding the secondary management server.

the_rock · ‎2023-08-24

No, they can NOT be "eliminated", its essential part of mgmt database. As @PhoneBoy said, rebuilding is so much easier if you dont want to spend time troubleshooting, but to be fair, I can see why TAC may tell you to rebuild anyway, since fixing issue like this is not always easy.

As I told you yesterday, we had issue like this with customer before and after working with TAC for some time, one day, few of us had an internal call and put all our ideas together and by process of elimination decided to restore recent backup and boom, issue fixed.

Again, please dont ask me how that worked, as I still have no idea...at the end of the day, not that I dont care, it would have been nice to know, but it was way more important to fix the problem.

Cheers,

Andy

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2023-08-23

Bro, command is cpstat os -f all, can you send that as well?

Andy

Best,
Andy
"Have a great day and if its not, change it"

Daniel_3 · ‎2023-08-22

Hi,

I had similar issues on an open server and the problem was a faulty RAID battery or cache battery. After a battery replacement the system went back to normal state.

Matlu · ‎2023-08-23

Hello,

Sorry, but this RAID has to do directly with the OPEN SERVER, right?

Our GAIA OS, is on a VMWARE ESXi

This RAID is something that should be checked at the server level as such?

Regards

Timothy_Hall · ‎2023-08-23

It would have been helpful to know you are running in VMWare.

Run the top command. What do the wa (waiting for I/O) and st (steal) percentages look like? If they are constantly nonzero or even very high values, that indicates disk I/O contention (wa) with other virtual machines due to oversubscription of the disk path, or the VM being denied access to the CPU when it wants it due to oversubscription of CPU resources (st) by the hypervisor. Also see here for overall best practices:

sk104848: Best Practices - Performance Optimization of Security Management Server installed on VMwar...

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-08-23

I understand the above, but I have a question.

Reinstalling Gaia OS, can it be a definitive solution for our problem?

Or this that now we have as a problem, can remain, installing the GAIA OS again from 0?

Regards

_Val_ · ‎2023-08-23

Why would you reinstall in the first place?

Are you a member of CheckMates?

Management problem with GAIA OS