Problemas restoring the 2nd node of a VSX cluster

Wipeout

Hi all!

We had a full crash on both VSX gateways of a 2 node VSX cluster.
Versions (SMS R81.20 / VSX gateways R80.40)

We managed to restore the first node using vsx_util reconfigure getting a working cluster with a single working node.

Tried to restore the second one using the same method but just after the command vsx_util reconfigure command finished (so the gateway is set into VSX mode and received via push the configurations and virtual systems), many communications started to fail.

Checking the first node status with "cphaprob state" showed that 2 out of 4 virtual devices were in standby mode. So supposedly the 2nd node that was still in process of being restored (there were tasks still to be done: reboot, configure the license, configure local.arp, enable dynamic objects, install policies...) had 2 virtual devices in Active state.

Tried to "cphaprob state" and "clusterxl_admin down" to force failover but these commands did not show any output and nothing changed in the status of the virtual devices on the 1st node. Disconnecting interfaces on 2nd node didnt change anything either.
Shutting down this 2nd node made the first node be the active one for all virtual systems.

- why did the node become active for the virtual devices while still not fully restored?
- is there any way to avoid this behaviour?
- what would be the correct procedure?

Thanks all!

the_rock

Sounds like you did everything right. Other than checking the logs to see if there is anything obvious, I would definitely open the TAC case to see if they can provide a reason.

Andy

Chris_Atkinson

Is the reason for the initial crash understood and resolved?

Which JHF is the cluster using? (R80.40 is EOL)

The VSX recovery procedure is outlined in sk101515.

CCSM R77/R80/ELITE

Wipeout_

Is the reason for the initial crash understood and resolved? Which JHF is the cluster using?

It was when trying to recover a deleted virtual system. So supposedly it will not crash unless doing the same actions.

The VSX recovery procedure is outlined in sk101515.

Thanks, that was just was i needed.
Just a doubt, step 10 shows the way to prevent the cluster member from becoming active before the reconfig ends by using cphastop,cphaconf... After rebooting it requires any command to make it become active?

Thanks!

JozkoMrkvicka

Do you have VSLS in use? I can imagine that as soon as second member's VS is standby, VSLS will fire and make sure the load is 1:1 between nodes.

Did you check output of "cphaprob -a if" ? Maybe one member had different number of required interfaces which in many cases trigger unplanned failover.

Kind regards,
Jozko Mrkvicka

G_W_Albrecht

I would suggest to open a SR# with TAC to get the issue resolved ! You are aware of the fact that version R80.40 is out of support since April 2024 ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

Problemas restoring the 2nd node of a VSX cluster