Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Matlu
Advisor

My context HA environment broke in VSX

Hello, everyone.

I don't have much experience in VSX and MDS environments, hopefully you can clarify the doubt.

I currently have a problem in one of my contexts, and the HA of the Cluster has been lost.

CD1.png

If the “Standby” member of the context is lost, what can be the “most practical way” to recover its operation?

Should I still be able to access that member that appears as “Lost” by the CLI?

Can I check the root-cause of why the Cluster of my context was “broken”?

Thanks for your comments.

0 Kudos
18 Replies
Duane_Toler
Advisor

First step is to login to that second node, then run "cphaprob stat" for VS0.  Then go to that one VS (vsenv 3) and run "cphaprob stat" again per-VS.  Start there and it should give you some hints.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
the_rock
Legend
Legend

I dont know if all of below work on VSX, but worth comparing.

Andy

cphaprob roles

cphaprob state

cphaprob -i list

cphaprob -l list

cphaprob syncstat

0 Kudos
Lesley
Mentor Mentor
Mentor

Able to push policy to this problem vs, and to vs0?

Issue still there if you restart the VS itself?

  1. Connect to the command line on the VSX Gateway.
  2. Go to the context of the Virtual System:
    • In Gaia Clish, run:
      set virtual-system <VSID>
    • In the Expert mode, run:
      vsenv <VSID>
  3. Stop the Virtual System:
    cpstop
  4. Start the Virtual System:
    cpstart
-------
If you like this post please give a thumbs up(kudo)! 🙂
the_rock
Legend
Legend

That sounds very logical to me.

Andy

0 Kudos
Matlu
Advisor

Hello,

Initially the problem was with only one particular vsenv, in this case, ID 3, but over the course of the hours, from one moment to another unexpectedly the whole box (VS0) has rebooted for no reason.

The device is up again, but “vsenv 3” is still not available for the cluster.

Are there any files that indicate a possible root-cause of “why” an instance as such “crashes”?

Regards.

0 Kudos
genisis__
Mentor Mentor
Mentor

What version are you running and what Jumbo is installed?
What files are in /var/log/crash and /var/log/dump?
If you see files there that match when the node rebooted, pull these off and get a cpinfo run soon as possible.  

Get a TAC case raise to investigate these file (if there are any).

How long has the VS been stable?  If its been good, what changed in the environment? 
As the other have said start with cphaprob commands to determine status (suspect this will give you a clue).

 

0 Kudos
Matlu
Advisor

Hello.
I have a version R81.20
JHF Take 82

[Expert@FWCP-AC:3]# cphaprob state
[Expert@FWCP-AC:3]# cphaprob state

HA module not started.

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# last reboot
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:30 - 17:52 (02:22)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:22 - 17:52 (02:30)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:15 - 17:52 (02:36)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:06 - 17:52 (02:45)

wtmp begins Tue Mar 18 12:32:46 2025
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Greetings

0 Kudos
the_rock
Legend
Legend

Can you make sure clustering is enabled via cpconfig? If it is, maybe try cphastop; cphastart

Andy

0 Kudos
G_W_Albrecht
Legend Legend
Legend

Do you have a SR# with CP TAC open already ? I fear getting help is not so easy...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist
0 Kudos
Matlu
Advisor

Hello,


The logs that are related to what happens with the Clusters, for example, when the cluster breaks, or one of the members freezes, or there is an unexpected switch, where can these logs be reviewed?
Is there a way to review the events related to this, from the last 24 hours?
Could you share me the syntax please?


Thanks for the help.

0 Kudos
the_rock
Legend
Legend

Hey bro, it should be all in /var/log/messages

example:

grep -i DOWN /var/log/messages*

you can replace word down with anything else you wish to search, ie clusterXL, freeze, etc

0 Kudos
Matlu
Advisor

Buddy,

The command you just passed me works very well, but is there a way to print only the “last 100 lines”?

Because the command prints the whole thing.

0 Kudos
the_rock
Legend
Legend

grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt

cd /var/log

tail -50 clusterissue.txt

0 Kudos
the_rock
Legend
Legend

example in my lab.

Andy

***********************

[Expert@CP-FW-01:0]# grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt; cd /var/log; tail -50 clusterissue.txt
/var/log/messages.10:Mar 18 14:27:50 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.10:Mar 18 14:27:51 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.3:Mar 19 02:30:10 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.3:Mar 19 02:30:12 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
[Expert@CP-FW-01:0]#

0 Kudos
Alex-
Leader Leader
Leader

Consider running an HCP check on your system as well.

0 Kudos
AkosBakos
Mentor Mentor
Mentor

Before you restart your VS arrange a maintenance window for the outage!

----------------
\m/_(>_<)_\m/
0 Kudos
Matlu
Advisor

Hi.

I rebooted my standby box of my vsx cluster, but when I pick up, the VS ID 3 cluster, it was still broken.

 

I enter the VS ID 3 of the standby member, and I gave the command ‘cphastart’, immediately, the member already appeared inside the cluster but it appears as DOWN.

 

Immediately I gave the command ‘clusterXL_admin up’, but the member does not change status and continues as DOWN.

 

It is very strange.

 

Is there any other way to recover this member, so that it forms correctly the Cluster of my VS 3?

Greetings.

0 Kudos
the_rock
Legend
Legend

Hey bro,

I know this may sound weird, but few times when I had this problem with customers, we had to reboot BOTH boxes to get it working.

Andy

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events