Create a Post
scottikon
Contributor

Troubleshooting Azure HA cluster failover and the API call

We are deploying a new cluster for a customer and we wanted to test failover. I have tested this in a test Azure account previously and this worked. 

I built another test environment today and I am showing the same symptoms as the customer. 

Everything seems to deploy fine, can establish SIC with management server and install policy etc. However, if we failover, either by running clusterXL_admin down or by powering off the active gateway. A failover is triggered within Check Point, i.e., cphaprob stat on the secondary gateway shows it is now active but the cluster-vip IP is still showing in Azure on the other gateway. This has not moved across to the second gateway. 

This suggests to me that either the gateway isn't triggering the API call or the API call is triggered but not actioned and I wonder how we troubleshoot this. 

Was hoping to get some help from the community before going through TAC because you have to do the initial hoop jumping before you get to someone who knows cloud. 

Thanks

Scott

3 Replies
PhoneBoy
Admin
Admin

I'd start with running $FWDIR/scripts/azure_ha_test.py and see what it says.

scottikon
Contributor

So the output I get is: - 

 

Image version is: harry_main-294-801-GW
Reading configuration file...
Setting api versions for "ha" solution
ARM versions are: {
"resources": "?api-version=2019-07-01"
}
Error:
The hostname xxxxfw002 should be either 'xxxxfw01' or 'xxxxfw02'
[Expert@xxxxfw002:0]#

 

What is it comparing it to? The name in the SmartConsole or the name in Azure?

 

Must be Azure as I have checked SmartConsole and it has the fw002 object name matching the fw002 hostname on GAIA. 

0 Kudos
JanVC
Collaborator

Yes it is checking the name of the VM in the azure portal.

If you deployed the ARM template and manually did some changes to the hostname you're in for some fun changes in the azure_ha_test.py and azure_had.py script on the gateways

This is part of the script where it (hardcoded) looks for cluster_name+1 as the name of the first member

    if conf['hostname'] not in {cluster_name + '1', cluster_name + '2'}:

Please also check 

https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_CloudGuard_IaaS_HighAvailability_for_...


It explains manual testing without executing the failover

And the important part about the naming convention (because of the hardcoded scripts):
Naming Constraints

Do not change the name of any resources.

Cluster Members VM names must match the Cluster name with a suffix of '1' and '2'.

Network Interface names must match the Cluster Member VM names with a suffix of '-eth0' and '-eth1'.

The IP address of the cluster has to match the configuration file.

By default it should match the cluster name.

0 Kudos