Re: My context HA environment broke in VSX

Matlu

Hello, everyone.

I don't have much experience in VSX and MDS environments, hopefully you can clarify the doubt.

I currently have a problem in one of my contexts, and the HA of the Cluster has been lost.

If the “Standby” member of the context is lost, what can be the “most practical way” to recover its operation?

Should I still be able to access that member that appears as “Lost” by the CLI?

Can I check the root-cause of why the Cluster of my context was “broken”?

Thanks for your comments.

Duane_Toler

First step is to login to that second node, then run "cphaprob stat" for VS0. Then go to that one VS (vsenv 3) and run "cphaprob stat" again per-VS. Start there and it should give you some hints.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock

I dont know if all of below work on VSX, but worth comparing.

Andy

cphaprob roles

cphaprob state

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Lesley

Able to push policy to this problem vs, and to vs0?

Issue still there if you restart the VS itself?

Connect to the command line on the VSX Gateway.
Go to the context of the Virtual System:
- In Gaia Clish, run:
  set virtual-system <VSID>
- In the Expert mode, run:
  vsenv <VSID>
Stop the Virtual System:
cpstop
Start the Virtual System:
cpstart

-------
If you like this post please give a thumbs up(kudo)! 🙂

the_rock

That sounds very logical to me.

Andy

Matlu

Hello,

Initially the problem was with only one particular vsenv, in this case, ID 3, but over the course of the hours, from one moment to another unexpectedly the whole box (VS0) has rebooted for no reason.

The device is up again, but “vsenv 3” is still not available for the cluster.

Are there any files that indicate a possible root-cause of “why” an instance as such “crashes”?

Regards.

genisis__

What version are you running and what Jumbo is installed?
What files are in /var/log/crash and /var/log/dump?
If you see files there that match when the node rebooted, pull these off and get a cpinfo run soon as possible.

Get a TAC case raise to investigate these file (if there are any).

How long has the VS been stable? If its been good, what changed in the environment?
As the other have said start with cphaprob commands to determine status (suspect this will give you a clue).

Matlu

Hello.
I have a version R81.20
JHF Take 82

[Expert@FWCP-AC:3]# cphaprob state
[Expert@FWCP-AC:3]# cphaprob state

HA module not started.

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# last reboot
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:30 - 17:52 (02:22)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:22 - 17:52 (02:30)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:15 - 17:52 (02:36)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:06 - 17:52 (02:45)

wtmp begins Tue Mar 18 12:32:46 2025
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Greetings

the_rock

Can you make sure clustering is enabled via cpconfig? If it is, maybe try cphastop; cphastart

Andy

G_W_Albrecht

Do you have a SR# with CP TAC open already ? I fear getting help is not so easy...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Matlu

Hello,

The logs that are related to what happens with the Clusters, for example, when the cluster breaks, or one of the members freezes, or there is an unexpected switch, where can these logs be reviewed?
Is there a way to review the events related to this, from the last 24 hours?
Could you share me the syntax please?

Thanks for the help.

the_rock

Hey bro, it should be all in /var/log/messages

example:

grep -i DOWN /var/log/messages*

you can replace word down with anything else you wish to search, ie clusterXL, freeze, etc

Matlu

Buddy,

The command you just passed me works very well, but is there a way to print only the “last 100 lines”?

Because the command prints the whole thing.

the_rock

grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt

cd /var/log

tail -50 clusterissue.txt

the_rock

example in my lab.

Andy

***********************

[Expert@CP-FW-01:0]# grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt; cd /var/log; tail -50 clusterissue.txt
/var/log/messages.10:Mar 18 14:27:50 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.10:Mar 18 14:27:51 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.3:Mar 19 02:30:10 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.3:Mar 19 02:30:12 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
[Expert@CP-FW-01:0]#

Alex-

Consider running an HCP check on your system as well.

AkosBakos

Before you restart your VS arrange a maintenance window for the outage!

----------------
\m/_(>_<)_\m/

Matlu

Hi.

I rebooted my standby box of my vsx cluster, but when I pick up, the VS ID 3 cluster, it was still broken.

I enter the VS ID 3 of the standby member, and I gave the command ‘cphastart’, immediately, the member already appeared inside the cluster but it appears as DOWN.

Immediately I gave the command ‘clusterXL_admin up’, but the member does not change status and continues as DOWN.

It is very strange.

Is there any other way to recover this member, so that it forms correctly the Cluster of my VS 3?

Greetings.

the_rock

Hey bro,

I know this may sound weird, but few times when I had this problem with customers, we had to reboot BOTH boxes to get it working.

Andy

Are you a member of CheckMates?

My context HA environment broke in VSX