Re: My context HA environment broke in VSX

Matlu · ‎2025-03-18

Hello, everyone.

I don't have much experience in VSX and MDS environments, hopefully you can clarify the doubt.

I currently have a problem in one of my contexts, and the HA of the Cluster has been lost.

If the “Standby” member of the context is lost, what can be the “most practical way” to recover its operation?

Should I still be able to access that member that appears as “Lost” by the CLI?

Can I check the root-cause of why the Cluster of my context was “broken”?

Thanks for your comments.

Duane_Toler · ‎2025-03-18

First step is to login to that second node, then run "cphaprob stat" for VS0. Then go to that one VS (vsenv 3) and run "cphaprob stat" again per-VS. Start there and it should give you some hints.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2025-03-18

I dont know if all of below work on VSX, but worth comparing.

Andy

cphaprob roles

cphaprob state

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Best,
Andy
"Have a great day and if its not, change it"

Lesley · ‎2025-03-18

Able to push policy to this problem vs, and to vs0?

Issue still there if you restart the VS itself?

Connect to the command line on the VSX Gateway.
Go to the context of the Virtual System:
- In Gaia Clish, run:
  set virtual-system <VSID>
- In the Expert mode, run:
  vsenv <VSID>
Stop the Virtual System:
cpstop
Start the Virtual System:
cpstart

-------
Please press "Accept as Solution" if my post solved it 🙂

the_rock · ‎2025-03-18

That sounds very logical to me.

Andy

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2025-03-18

Hello,

Initially the problem was with only one particular vsenv, in this case, ID 3, but over the course of the hours, from one moment to another unexpectedly the whole box (VS0) has rebooted for no reason.

The device is up again, but “vsenv 3” is still not available for the cluster.

Are there any files that indicate a possible root-cause of “why” an instance as such “crashes”?

Regards.

genisis__ · ‎2025-03-18

What version are you running and what Jumbo is installed?
What files are in /var/log/crash and /var/log/dump?
If you see files there that match when the node rebooted, pull these off and get a cpinfo run soon as possible.

Get a TAC case raise to investigate these file (if there are any).

How long has the VS been stable? If its been good, what changed in the environment?
As the other have said start with cphaprob commands to determine status (suspect this will give you a clue).

Matlu · ‎2025-03-18

Hello.
I have a version R81.20
JHF Take 82

[Expert@FWCP-AC:3]# cphaprob state
[Expert@FWCP-AC:3]# cphaprob state

HA module not started.

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]# last reboot
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:30 - 17:52 (02:22)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:22 - 17:52 (02:30)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:15 - 17:52 (02:36)
reboot system boot 3.10.0-1160.15.2 Tue Mar 18 15:06 - 17:52 (02:45)

wtmp begins Tue Mar 18 12:32:46 2025
[Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#
Expert@FWCP-AC:3]# [Expert@FWCP-AC:3]#

Greetings

the_rock · ‎2025-03-18

Can you make sure clustering is enabled via cpconfig? If it is, maybe try cphastop; cphastart

Andy

Best,
Andy
"Have a great day and if its not, change it"

G_W_Albrecht · ‎2025-03-19

Do you have a SR# with CP TAC open already ? I fear getting help is not so easy...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Matlu · ‎2025-03-18

Hello,

The logs that are related to what happens with the Clusters, for example, when the cluster breaks, or one of the members freezes, or there is an unexpected switch, where can these logs be reviewed?
Is there a way to review the events related to this, from the last 24 hours?
Could you share me the syntax please?

Thanks for the help.

the_rock · ‎2025-03-18

Hey bro, it should be all in /var/log/messages

example:

grep -i DOWN /var/log/messages*

you can replace word down with anything else you wish to search, ie clusterXL, freeze, etc

Best,
Andy
"Have a great day and if its not, change it"

Matlu · ‎2025-03-18

Buddy,

The command you just passed me works very well, but is there a way to print only the “last 100 lines”?

Because the command prints the whole thing.

the_rock · ‎2025-03-18

grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt

cd /var/log

tail -50 clusterissue.txt

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2025-03-19

example in my lab.

Andy

***********************

[Expert@CP-FW-01:0]# grep -i DOWN /var/log/messages* > /var/log/clusterissue.txt; cd /var/log; tail -50 clusterissue.txt
/var/log/messages.10:Mar 18 14:27:50 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.10:Mar 18 14:27:51 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.3:Mar 19 02:30:10 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.3:Mar 19 02:30:12 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Contacting the Download Center
/var/log/messages.6:Mar 18 20:29:01 2025 CP-FW-01 xpand[6967]: admin localhost t +installer:update_status_message Received 148 results from the Download Center
[Expert@CP-FW-01:0]#

Best,
Andy
"Have a great day and if its not, change it"

Alex- · ‎2025-03-19

Consider running an HCP check on your system as well.

AkosBakos · ‎2025-03-19

Before you restart your VS arrange a maintenance window for the outage!

----------------
\m/_(>_<)_\m/

Matlu · ‎2025-03-19

Hi.

I rebooted my standby box of my vsx cluster, but when I pick up, the VS ID 3 cluster, it was still broken.

I enter the VS ID 3 of the standby member, and I gave the command ‘cphastart’, immediately, the member already appeared inside the cluster but it appears as DOWN.

Immediately I gave the command ‘clusterXL_admin up’, but the member does not change status and continues as DOWN.

It is very strange.

Is there any other way to recover this member, so that it forms correctly the Cluster of my VS 3?

Greetings.

the_rock · ‎2025-03-19

Hey bro,

I know this may sound weird, but few times when I had this problem with customers, we had to reboot BOTH boxes to get it working.

Andy

Best,
Andy
"Have a great day and if its not, change it"

Are you a member of CheckMates?

My context HA environment broke in VSX