Solved: Re: Problemas restoring the 2nd node of a VSX clus...

Wipeout · ‎2025-01-28

Hi all!

We had a full crash on both VSX gateways of a 2 node VSX cluster.
Versions (SMS R81.20 / VSX gateways R80.40)

We managed to restore the first node using vsx_util reconfigure getting a working cluster with a single working node.

Tried to restore the second one using the same method but just after the command vsx_util reconfigure command finished (so the gateway is set into VSX mode and received via push the configurations and virtual systems), many communications started to fail.

Checking the first node status with "cphaprob state" showed that 2 out of 4 virtual devices were in standby mode. So supposedly the 2nd node that was still in process of being restored (there were tasks still to be done: reboot, configure the license, configure local.arp, enable dynamic objects, install policies...) had 2 virtual devices in Active state.

Tried to "cphaprob state" and "clusterxl_admin down" to force failover but these commands did not show any output and nothing changed in the status of the virtual devices on the 1st node. Disconnecting interfaces on 2nd node didnt change anything either.
Shutting down this 2nd node made the first node be the active one for all virtual systems.

- why did the node become active for the virtual devices while still not fully restored?
- is there any way to avoid this behaviour?
- what would be the correct procedure?

Thanks all!

Chris_Atkinson · ‎2025-01-28

Is the reason for the initial crash understood and resolved?

Which JHF is the cluster using? (R80.40 is EOL)

The VSX recovery procedure is outlined in sk101515.

CCSM R77/R80/ELITE

View solution in original post

the_rock · ‎2025-01-28

Sounds like you did everything right. Other than checking the logs to see if there is anything obvious, I would definitely open the TAC case to see if they can provide a reason.

Andy

Best,
Andy

Chris_Atkinson · ‎2025-01-28

Is the reason for the initial crash understood and resolved?

Which JHF is the cluster using? (R80.40 is EOL)

The VSX recovery procedure is outlined in sk101515.

CCSM R77/R80/ELITE

Wipeout_ · ‎2025-01-29

Is the reason for the initial crash understood and resolved? Which JHF is the cluster using?

It was when trying to recover a deleted virtual system. So supposedly it will not crash unless doing the same actions.

The VSX recovery procedure is outlined in sk101515.

Thanks, that was just was i needed.
Just a doubt, step 10 shows the way to prevent the cluster member from becoming active before the reconfig ends by using cphastop,cphaconf... After rebooting it requires any command to make it become active?

Thanks!

Wipeout · ‎2025-02-19

Thanks Chris! Sorry for the late response.
The sk101515 went flawless.
I would only add another step. After the vsx_util reconfigure and before the reboot, i would set virtual devices down (with persistence flag) using clusterXL_admin.
Then reboot, add the cables, perform the pushes, install the required policies and perform configurations (f.e. for certain configurations, internet access is a prerequisite).
After that the different Virtual Devices can be checked one by one via clusterXL_admin

Thanks again!

JozkoMrkvicka · ‎2025-01-28

Do you have VSLS in use? I can imagine that as soon as second member's VS is standby, VSLS will fire and make sure the load is 1:1 between nodes.

Did you check output of "cphaprob -a if" ? Maybe one member had different number of required interfaces which in many cases trigger unplanned failover.

Kind regards,
Jozko Mrkvicka

Wipeout · ‎2025-02-19

My cluster is active-standby.Not VSLS.
With the sk101515 Chris_Atkinson commented, everything went ok

G_W_Albrecht · ‎2025-01-29

I would suggest to open a SR# with TAC to get the issue resolved ! You are aware of the fact that version R80.40 is out of support since April 2024 ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Wipeout · ‎2025-02-19

Sorry for the late response.
Yes, i was aware. But first of all i wanted to restore the cluster with adding extra factors.

Thanks

CheckPointerXL · ‎2025-02-18

Always configure Active UP before operation like this

You cluster config was probably Primary UP

Wipeout · ‎2025-02-19

Finally the sk101515 gives steps to avoid the problem of the second node becoming active unexpectedly while installing
Thanks

Are you a member of CheckMates?

Problemas restoring the 2nd node of a VSX cluster