Re: Problem with HA

Matlu · ‎2023-02-10

Hello,

I have a ClusterXL of 2 GWs, appliances 12200 in version R80.40.

The ClusterXL is managed from a CMA (We have an environment that works from a MDS, in which we have created 3 CMA in total).

We had some problems with the security rule base in the Cluster, reason for which, after "restarting" the equipment, the HA "broke" and began to appear an alert message in the CLI that said "HA MODULE NOT STARTED", and visually, in the CLI, the Cluster equipment did not appear.

Only appeared the member in which "was stopped" and its status was "DOWN".

Too weird.

Someone can guide me how I could solve this case.

Thanks for your comments.

Timothy_Hall · ‎2023-02-10

Sounds like someone disabled ClusterXL from cpconfig and that change became effective upon reboot, see here: sk88360: 'Error: 'ClusterXL' is not responding. Verify that 'ClusterXL' is installed on the gateway'...

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Chris_Atkinson · ‎2023-02-10

Plenty of SK coverage for scenarios involving this error. Despite the 12200 being EOL devices.

What do you see with the following:

fw stat

cpstat -f policy fw

CCSM R77/R80/ELITE

Matlu · ‎2023-02-10

Hello,

Currently we have only 1 ClusterXL team, working.
The other unit, the customer decided to "shut it down" for the moment, because it was causing "intermittency" in the ClusterXL, after restarting it.

At this moment, the data I have, is from the computer that is now working.

I am going to go to the client's DataCenter, to try to check the "down" machine from the CLI, and execute the recommendations that you are giving me.

Do you have any additional recommendations, that you think could be useful in this scenario?

Regards.

Matlu · ‎2023-02-10

Hello,

I share the outputs of the recommended commands in the "problematic" FW.

I have tried to restore the problem:

1) Restarting the computer more than 1 time.
2) Applying the sk88360

I can't solve the problem.

From the SmartConsole, it appears that this computer is "lost".

the_rock · ‎2023-02-10

Can you compare that file on both cluster members? Personally, I had NEVER seen that message before and I dealt with lots of clusters.

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-02-10

Hello, Rock

How can I "read" those files that appear when I run the "cpconfig" command?

I am currently in the path "/home/admin".

On the computer that is currently running, what appears after running the "cphaprob state" is the following.

Thanks Buddy for your help.

the_rock · ‎2023-02-10

Keep in mind, Gaia has always been based on Linux OS, so its simply Linux cat command

You can run below from expert mode. See if same command works on problematic fw as well.

Andy

expert mode -> cat /etc/fw.boot/ha_boot.conf

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-02-10

Thank you for your response, Andy.

I tell you, the reading of the file is only "viable" in the equipment that is working now, but the equipment that is broken, does not let me "read" that file.

I show you...

I find this very strange.
On the damaged computer, I can not "run" the command "cpinfo -yall" to see the current JHF.

Any suggestions?

Bob_Zimmerman · ‎2023-02-10

Check the space on the problem member. If the drive is full or is read-only for some reason, it won't be able to create the file and open it to write to it. Run these two commands on the problem member:

df -k

mount

the_rock · ‎2023-02-10

I think what @Timothy_Hall said makes perfect sense. If you look at below on my lab R81.20 thats single gateway, check out the output (obviously, its just one fw, so no clustering). I also have perfectly working cluster lab on R81.10, so happy to do any testing you need.

Andy

[Expert@quantum-firewall:0]#
[Expert@quantum-firewall:0]# fw ver
This is Check Point's software version R81.20 - Build 703
[Expert@quantum-firewall:0]# cphaprob state

HA module not started.

[Expert@quantum-firewall:0]# cpconfig
This program will let you re-configure
your Check Point products configuration.

Configuration Options:
----------------------
(1) Licenses and contracts
(2) SNMP Extension
(3) PKCS#11 Token
(4) Random Pool
(5) Secure Internal Communication
(6) Enable cluster membership for this gateway
(7) Check Point CoreXL
(8) Automatic start of Check Point Products

(9) Exit

Enter your choice (1-9) :9

Thank You...
[Expert@quantum-firewall:0]#

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2023-02-10

Hello,

I am trying to check the "TRUST STATE" between SMS and GW which is failing. Currently at SmartConsole level, the GW, appears "alerted" saying that it is "lost".

When I enter the object properties of the GW from the SmartConsole, and go to "TRUST STATE", I see that "Trust Established" appears, but when I try to hit "Test SIC Status", the following error message appears.

SIC Status for SBORINT1RENIEC: Unknown

Could not establish TCP connection with 10.47.2.220

** Please make sure that Check Point Services are running on SBORINT1RENIEC and that TCP connectivity is allowed from Security Management Server to IP 10.47.2.220, Port 18191 **

Chris_Atkinson · ‎2023-02-10

Per your previous "fw stat" output the gateway doesn't have a proper policy atm to allow traffic e.g. "defaultfilter'.

Before giving further suggestions how is the member forced down/offline currently are it's interfaces connected or isolated?

Probably much easier to work this with TAC via a remote session.

CCSM R77/R80/ELITE

Matlu · ‎2023-02-10

Hello,

When I "apply" the command "cphaprob -a if" on the computer that is currently working fine, I have the following, I have the following result.

The client has its network quite messed up.
According to the result that I share with you, I understand that the interface that is to synchronize, is down right?

I have checked the physical port of both FIREWALLS, that is the eth7 port, and it has a Patchcord that physically connects to both eth7 ports.

It is not going to a SW, as I would think.

Currently, as much as I try to "link" the failing GW to the SMS, I can't do it, because I get constant failure errors on the SIC.

Any suggestions?

Chris_Atkinson · ‎2023-02-10

You are not answering the questions which makes things difficult to provide guidance. To expedite please contact TAC.

That being said to allow the communication again to Mgmt you would likely need to issue "fw unloadlocal" on the problematic member as it is currently being blocked by "defaultfilter".

But since we don't know the reason for the cluster member having issues originally you likely want to ensure it won't become the active in an uncontrolled way immediately following resolving the policy issue.

CCSM R77/R80/ELITE

Matlu · ‎2023-02-10

I understand your point of view.

I have a doubt, as far as I remember the "fw unloadlocal" removes all the policies in that Firewall, right?

Is it advisable to apply this command, "isolating" the equipment from the HA?
I mean if it would be necessary, that this problematic Firewall, disconnect the network cables that have installed, to apply this task, or is not necessary?

Once the Firewall manages to synchronize with the SMS again, what guarantees me that it "has" all the policies and is aligned with the Firewall that is working well.

Thanks for your comments.

the_rock · ‎2023-02-10

Put it this way...everything Chris told you makes sense. More info you give us, more we can help you fix this. Here is my take...and not saying Im correct in stating this, but sounds to me like you have wayyy bigger issue than just clustering on that machine.

Lets start with basics"

1) What does fw stat show? IF it shows defaultfilter, that block EVERYTHING, so you need to run fw unloadlocal...yes, it will unload the policy and all incoming connections to the firewall will be accepted, but dont be too concerned about it for the time being, as correct policy needs to be loaded anyway

2) What does cpwd_admin list show?

3) From problematic firewall, can you run ANY commands? cpview, cpinfo, cpconfig, anything?

Andy

Best,
Andy
"Have a great day and if its not, change it"

Chris_Atkinson · ‎2023-02-10

The interface leading to mgmt atleast needs to be up, temporarily isolating the external interface for the problematic member is likely wise but will depend on your topology.

Once connectivity to Mgmt is restored you would reinstall (push) the policy from the Mgmt.

Again working through this with TAC is probably a good idea in case any unexpected issues arise.

CCSM R77/R80/ELITE

Are you a member of CheckMates?

Problem with HA