Re: VSX + VSLS Patching: Disaster Recovery when th...

StackCap43382

In a situation where there is VSX CLuster (VSLS) and one of the members fail resulting in an RMA, what is the established process of introducing the replacement?

In normal ClusterXL HA cluster you just ISO/blink the RMA & restore from a backup/snapshot and push policy. Even with no backup its easy to restore with info from the other member/Cluster Object.

For VSX I might be overthinking it but VSX recovering from a snapshot seems over simplistic.

VSX Supports GAIA snapshots and as per sk98068 Snapshots can be placed onto the RMA as long as its the same Appliance:
https://support.checkpoint.com/results/sk/sk98068

So my question is for VSX+VSLS if a member fails can it just be recovered from the snapshot and policy pushed to it like a normal firewall or do we need to remove the old member and add the new member via vsx_util?
https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_VSX_AdminGuide/Topics-VSXG/W...

Given that the majority of the config is pushed from the manager, only the bonds and management interface needs to be configured prior to add a new member, either recovery option seems valid,

Regards.

EDIT:

How to back up and restore VSX gateway
https://support.checkpoint.com/results/sk/sk100395

"The only exception is if the configuration of the VSX Gateway / VSX cluster object was not changed since the backup file was collected from the Management Server."

SO if the GAIA SNAPSHOT was taken before the failure and the MGMT configuration was not changed then the snapshot is a valid method.

CCSME, CCTE, CCME, CCVS

Magnus-Holmberg

If the cluster have lets say 4 members and one failed.
I would remove the member and then add a new box when it arrive to easier being able to work with the cluster when one of the member is dead. (the vsx_util remote and then add member)

https://www.youtube.com/c/MagnusHolmberg-NetSec

Bob_Zimmerman

For an RMA replacing a failed member, I would generally use vsx_util reconfigure rather than remove and add.

On the new member, you need to complete the first-time wizard or build the system with config_system, install the relevant jumbo, build all the bonds, and build the routing needed to reach the management server from VS0 (the routing is normally pretty easy, since VS0 is normally used only for management). Once the member is provisioned (either with reconfigure or add), you need to apply any dynamic routing configuration in the VSs. You can use this script to dump the clish configuration of every VS on one of your working members and on your replacement member, then use diff to find anything missing:

echo "" >/tmp/$(hostname).clish.txt
vsids=$(ip netns list 2>/dev/null | cut -d" " -f3 | cut -d")" -f1 | sort -n;ls /proc/vrf/ 2>/dev/null | sort -n)
for vsid in $vsids;do
echo "set virtual-system $vsid" >/tmp/script.clish
echo "show configuration" >>/tmp/script.clish

clish -if /tmp/script.clish \
| sed -E "s/^Processing .+?\r//g" \
| egrep -v "^NMINST0079" \
| egrep -v "^Done\. *$" \
>/tmp/clishConfig.txt

grep "set interface" /tmp/clishConfig.txt \
| grep ipv4-address \
| sed -E "s/set interface ([^ ]+) ipv4-address ([^ ]+) mask.*$/s@ \2( .*|$)@ \#\1 IPv4\#\\\1@g/" \
>/tmp/sedScript.txt

grep "set interface" /tmp/clishConfig.txt \
| grep ipv6-address \
| sed -E "s/set interface ([^ ]+) ipv6-address ([^ ]+) mask.*$/s@ \2( .*|$)@ \#\1 IPv6\#\\\1@g/" \
>>/tmp/sedScript.txt

grep "type numbered local" /tmp/clishConfig.txt \
| sed -E "s/add vpn tunnel ([^ ]+) type numbered local ([^ ]+) .*$/s@ \2( .*|$)@ \#VTI \1 IPv4\#\\\1@g/" \
>>/tmp/sedScript.txt

sed -Ef /tmp/sedScript.txt /tmp/clishConfig.txt \
| grep -v "set hostname " \
| grep -v "password-hash " \
| grep -v " Configuration of " \
| grep -v " Exported by admin on " \
| grep -v "Config lock is owned by " \
| grep -v "add ssh hba ipv4-address" \
| sort \
>>/tmp/$(hostname).clish.txt
echo "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-" \
>>/tmp/$(hostname).clish.txt
done

Chris_Atkinson

Please also review sk101515 for reference

CCSM R77/R80/ELITE

Magnus-Holmberg

Remember licenses if you have more than 25vs on the cluster, because the licens that it comes with default as new install is 25vs.
So if have more than that, need local licenses when adding it to the cluster if not it will fail during reconfigure / add member

https://www.youtube.com/c/MagnusHolmberg-NetSec

Wolfgang

@StackCap43382 if you changed anything of the VSX configuration (interfaces, switches, routing etc.) after the fail of one of the members you can't use the snapshot for restore. This is because VSX configuration is always a mix of configurations from SMS and gateway (with VSnext this crucial limitation will be gone). We did such a repair after a restore from an older snapshot but it was a very long running process with TAC and deep digging through a lot of scripts 😞

As @Magnus-Holmberg wrote, vsx_util does the job and is the recommended, fastest and safe way.

StackCap43382

It was my understanding that you cant make any changes to the VS gateway object of the VSX cluster object otherwise the database between the MGMT and the recovered GW would differ resulting in a provisioning error.

However if a hard change freeze was entered into once the issue with the upgrade/patching/general failure occurred the snapshot would still be viable.

But it seems the general consensus here is that the best thing to do is use vsx_util to remove the old failed member and then add the new one as a new member.

A long time ago we had an issue where someone deleted an existing HA cluster from the manager then re-created the cluster and members with the same name & details as before (But with a new version)

This caused all kinds of issues as there were still references to objects that didn't exist or now had the wrong UID.

When it comes to re-introducing the RMA into the cluster, assuming that the old one was removed first via util, is it best to choose a new name or can the old one be used?

E.g.

Cluster of two VSX members:

VSX1 & VSX2

VSX1 fails replacement is VSX01 to make sure there is no residual conflicting details.

CCSME, CCTE, CCME, CCVS

Martijn

Hi,

If you receive the new hardware (RMA), make sure it is identical to the current hardware. Interface-modules, memory, etc.
Follow the First Time Wizard and only configure management IP, hostname, DNS, NTP and default gateway. This must be the same as the failed hardware. Do not forget to remove the alias interface with IP 192.168.1.1.

Install the jumbo hotfix which is installed on the cluster on the new appliance.

Connect management interface of the new hardware to the network and on the management server run the 'vsx_util reconfigure' tool. If you have bond interfaces, make sure they are configured before you run the 'vsx_util reconfigure' tool.

The tool will reconfigure the cluster member and creates the VS's. But you need to check some custom files on the hardware. For example local.arp, fwkern.conf, etc.

All is described in sk101515 - How to Reconfigure a VSX Cluster member

Good luck!

Martijn

Are you a member of CheckMates?

VSX + VSLS Patching: Disaster Recovery when things go wrong?