Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Matlu
Advisor

Problem with HA

Hello,

 

I have a ClusterXL of 2 GWs, appliances 12200 in version R80.40.

The ClusterXL is managed from a CMA (We have an environment that works from a MDS, in which we have created 3 CMA in total).

 

We had some problems with the security rule base in the Cluster, reason for which, after "restarting" the equipment, the HA "broke" and began to appear an alert message in the CLI that said "HA MODULE NOT STARTED", and visually, in the CLI, the Cluster equipment did not appear.

Only appeared the member in which "was stopped" and its status was "DOWN".

Too weird.

Someone can guide me how I could solve this case.

 

Thanks for your comments.

0 Kudos
17 Replies
Timothy_Hall
Legend Legend
Legend

Sounds like someone disabled ClusterXL from cpconfig and that change became effective upon reboot, see here: sk88360: 'Error: 'ClusterXL' is not responding. Verify that 'ClusterXL' is installed on the gateway'...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Chris_Atkinson
Employee Employee
Employee

Plenty of SK coverage for scenarios involving this error. Despite the 12200 being EOL devices.

What do you see with the following:

fw stat

cpstat -f policy fw

 

 

CCSM R77/R80/ELITE
0 Kudos
Matlu
Advisor

Hello,

Currently we have only 1 ClusterXL team, working.
The other unit, the customer decided to "shut it down" for the moment, because it was causing "intermittency" in the ClusterXL, after restarting it.

At this moment, the data I have, is from the computer that is now working.

I am going to go to the client's DataCenter, to try to check the "down" machine from the CLI, and execute the recommendations that you are giving me.

ClusterXL2.jpg

Do you have any additional recommendations, that you think could be useful in this scenario?

Regards.

0 Kudos
Matlu
Advisor

Hello,

I share the outputs of the recommended commands in the "problematic" FW.

ClusterXL3.jpgClusterXL4.jpgClusterXL5.jpg

I have tried to restore the problem:

1) Restarting the computer more than 1 time.
2) Applying the sk88360

I can't solve the problem.

From the SmartConsole, it appears that this computer is "lost".

 

0 Kudos
the_rock
Legend
Legend

Can you compare that file on both cluster members? Personally, I had NEVER seen that message before and I dealt with lots of clusters.

0 Kudos
Matlu
Advisor

Hello, Rock

How can I "read" those files that appear when I run the "cpconfig" command?

I am currently in the path "/home/admin".

On the computer that is currently running, what appears after running the "cphaprob state" is the following.

ClusterXL6.jpg

Thanks Buddy for your help.

0 Kudos
the_rock
Legend
Legend

Keep in mind, Gaia has always been based on Linux OS, so its simply Linux cat command

You can run below from expert mode. See if same command works on problematic fw as well.

Andy

expert mode -> cat /etc/fw.boot/ha_boot.conf

0 Kudos
Matlu
Advisor

Thank you for your response, Andy.

I tell you, the reading of the file is only "viable" in the equipment that is working now, but the equipment that is broken, does not let me "read" that file.

I show you...

Cluster7.jpgCluster8.jpg


I find this very strange.
On the damaged computer, I can not "run" the command "cpinfo -yall" to see the current JHF.

Any suggestions?

0 Kudos
Bob_Zimmerman
Authority
Authority

Check the space on the problem member. If the drive is full or is read-only for some reason, it won't be able to create the file and open it to write to it. Run these two commands on the problem member:

df -k

mount

0 Kudos
the_rock
Legend
Legend

I think what @Timothy_Hall said makes perfect sense. If you look at below on my lab R81.20 thats single gateway, check out the output (obviously, its just one fw, so no clustering). I also have perfectly working cluster lab on R81.10, so happy to do any testing you need.

Andy


[Expert@quantum-firewall:0]#
[Expert@quantum-firewall:0]# fw ver
This is Check Point's software version R81.20 - Build 703
[Expert@quantum-firewall:0]# cphaprob state

HA module not started.

[Expert@quantum-firewall:0]# cpconfig
This program will let you re-configure
your Check Point products configuration.


Configuration Options:
----------------------
(1) Licenses and contracts
(2) SNMP Extension
(3) PKCS#11 Token
(4) Random Pool
(5) Secure Internal Communication
(6) Enable cluster membership for this gateway
(7) Check Point CoreXL
(8) Automatic start of Check Point Products

(9) Exit

Enter your choice (1-9) :9

Thank You...
[Expert@quantum-firewall:0]#

0 Kudos
Matlu
Advisor

Hello,

I am trying to check the "TRUST STATE" between SMS and GW which is failing. Currently at SmartConsole level, the GW, appears "alerted" saying that it is "lost".

When I enter the object properties of the GW from the SmartConsole, and go to "TRUST STATE", I see that "Trust Established" appears, but when I try to hit "Test SIC Status", the following error message appears.

SIC Status for SBORINT1RENIEC: Unknown

Could not establish TCP connection with 10.47.2.220

** Please make sure that Check Point Services are running on SBORINT1RENIEC and that TCP connectivity is allowed from Security Management Server to IP 10.47.2.220, Port 18191 **

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Per your previous "fw stat" output the gateway doesn't have a proper policy atm to allow traffic e.g.  "defaultfilter'.

Before giving further suggestions how is the member forced down/offline currently are it's interfaces connected or isolated?

Probably much easier to work this with TAC via a remote session.

 

 

 

CCSM R77/R80/ELITE
0 Kudos
Matlu
Advisor

Hello,

When I "apply" the command "cphaprob -a if" on the computer that is currently working fine, I have the following, I have the following result.


The client has its network quite messed up.
According to the result that I share with you, I understand that the interface that is to synchronize, is down right?

Cluster9.jpg

I have checked the physical port of both FIREWALLS, that is the eth7 port, and it has a Patchcord that physically connects to both eth7 ports.

It is not going to a SW, as I would think.

Currently, as much as I try to "link" the failing GW to the SMS, I can't do it, because I get constant failure errors on the SIC.

Any suggestions?

0 Kudos
Chris_Atkinson
Employee Employee
Employee

You are not answering the questions which makes things difficult to provide guidance. To expedite please contact TAC.

That being said to allow the communication again to Mgmt you would likely need to issue "fw unloadlocal" on the problematic member as it is currently being blocked by "defaultfilter".

But since we don't know the reason for the cluster member having issues originally you likely want to ensure it won't become the active in an uncontrolled way immediately following resolving the policy issue.

 

CCSM R77/R80/ELITE
0 Kudos
Matlu
Advisor

I understand your point of view.

I have a doubt, as far as I remember the "fw unloadlocal" removes all the policies in that Firewall, right?

Is it advisable to apply this command, "isolating" the equipment from the HA?
I mean if it would be necessary, that this problematic Firewall, disconnect the network cables that have installed, to apply this task, or is not necessary?

Once the Firewall manages to synchronize with the SMS again, what guarantees me that it "has" all the policies and is aligned with the Firewall that is working well.

Thanks for your comments.

0 Kudos
the_rock
Legend
Legend

Put it this way...everything Chris told you makes sense. More info you give us, more we can help you fix this. Here is my take...and not saying Im correct in stating this, but sounds to me like you have wayyy bigger issue than just clustering on that machine.

Lets start with basics"

1) What does fw stat show? IF it shows defaultfilter, that block EVERYTHING, so you need to run fw unloadlocal...yes, it will unload the policy and all incoming connections to the firewall will be accepted, but dont be too concerned about it for the time being, as correct policy needs to be loaded anyway

2) What does cpwd_admin list show?

3) From problematic firewall, can you run ANY commands? cpview, cpinfo, cpconfig, anything?

Andy

0 Kudos
Chris_Atkinson
Employee Employee
Employee

The interface leading to mgmt atleast needs to be up, temporarily isolating the external interface for the problematic member is likely wise but will depend on your topology.

Once connectivity to Mgmt is restored you would reinstall (push) the policy from the Mgmt.

Again working through this with TAC is probably a good idea in case any unexpected issues arise.

 

CCSM R77/R80/ELITE
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events