Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Attila_Bakos
Participant

Why doesn't Checkpoint failover without manual interaction in Azure?

I have setup checkpoint cluster in Azure using the new template (using cluster-vip) and it seems during the failover Checkpoint is not able to move the VIP until I manually release it from the Loadbalancer. Manual release means I disassociate it from the machine. As soon as I do it the failover continues and finishes successfully.

Without loadbalancer everything is working fine.

Please advise. Possible bug in the python script?

17 Replies
Neil_ZInk
Collaborator

have you installed vSec controller on your Mgmt  server?

what do you get when you run $FWDIR/scripts/azure_ha_test.py  ?

please note. It takes 3-5 minutes for UDRs to get updated.

you can reference SK110194 for the step by step.

0 Kudos
Attila_Bakos
Participant

Dear Neil Zink,

We deployed the cluster from Azure check point cluster template based on the reference you provided.

We do not use vsec controller and it wasn't a requirement on the reference you provided.

Failover modifies the internal and external routes as well. When it tries to move the cluster-vip from NODE1 to NODE2 it fails because the loadbalancer nat rule locks the cluster-vip.

We could easily reproduce with a fresh install.

Please advise.

Thank you.

0 Kudos
Neil_ZInk
Collaborator

What do you get when you run

# $FWDIR/scripts/azure_ha_test.py

0 Kudos
Attila_Bakos
Participant

Sorry I missed to answer that:

All tests were successful!

0 Kudos
Neil_ZInk
Collaborator

Can you verify the front-end Load balancer Is pointing to public IP for each gateway vs the VIP.

0 Kudos
Attila_Bakos
Participant

Its is connected to cluster-vip as stated in the guide. I cannot connect it to anything other than the private IP's on the public side.

0 Kudos
Neil_ZInk
Collaborator

I have two clusters setup the same way. This is my understanding on how it works (I could be wrong)

Cluster VIP is only used for communication to management server not for actual traffic flow.

Fronted IP -> points to FW1 public IP and FW2 public IP (not the VIP)

Inbound Nat Rules on load balancer:

Load Balancer IP -> service points to Active Member front end Private IP

On cluster policy you need create manual NAT rules for each of the Front End IPs to translate to internal Load balancer/server

On Failover the MGMT API changes:

1.NAT rule to new active member Front End internal IP

2.changes UDR default route new active member Back End IP

0 Kudos
Dan_Morris
Employee
Employee

Hi Neil,

Your Cluster flow is correct for the old version. There are two version of Azure Cloud Deployment. The version you have describe is the older template. To clarify the main reason for the VIP in the old template is for VPN. It can also be used for natting traffic leaving the environment also.

The two SK's referenced for this:

New Azure Cluster template:
Solution Title: Deploying a Check Point Cluster in Microsoft Azure
Solution ID: sk110194
Solution Link:https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Old Azure Cluster template:

Solution Title: Deploying a Check Point Cluster in Microsoft Azure - for templates older than 20180301
Solution ID: sk122793
Solution Link:https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Thank you,

Dan Morris, Technology Leader, Ottawa Technical Assistance Center

0 Kudos
Randall_Norris
Explorer

Hi

You're not alone!  I've got the same issue on a new template gateway deployed about 6-Apr-2018.  Working through support at both CP and MS on this.   I've been able to reproduce on USEAST new deploys several times in my test subscriptions.   

Followed sk110194 for the build / deploy

all azure_ha_test.py elements pass

As soon as an inbound nat rule on the loadbalancer is added, fail-over hangs with the cluster-vip on the nic never moving and the loadbalancer is never updated.

Delete the nat rule, and fail-over will complete

0 Kudos
Attila_Bakos
Participant

Hi,

Thank you for your reply. I assumed its a bug from CP side, but it puzzles me how this was not found by CP before their release.

Please let me know the outcome of the support as I cannot go forward with my build and tests and my deadline is close.

I might have choose a different vendor.

0 Kudos
Dan_Morris
Employee
Employee

Hi Attila and Randall,

Can you confirm from the $FWDIR/log/azure_had.elg if you are getting any error such as "RequestException: HTTP/1.1 400 Bad Request"  ?

In the Azure Portal are you see error such as ","Microsoft.Network/networkInterfaces/write","Failed","Error"," ?

if so this issue is currently under investigation. The problem is related to the translation of the API call is being made to the Azure portal. Unsure at this time what may have changed but it is currently being investigated. 

I would recommend to open up a support ticket for this issue. If you have please e-mail the SR number to me. My e-mail address is dmorris@checkpoint.com

Thank you,

Dan Morris, Technology Leader, Ottawa Technical Assistance Center

0 Kudos
Jonathan_Lebowi
Employee Alumnus
Employee Alumnus

Hi all, at least until the issue is resolved here's a decent workaround for allowing automatic failover that supports also inbound traffic: 

The steps are
1) Create some “basic” external loadbalancer (the name and resource group are not relevant here…the failover script will not change anything on this LB. This will work also if you use an app gateway) with some static public ip address as its frontend
2) On the LB, Create a backend pool with the private ip addresses of eth0 of the cluster members
3) On the LB, Create a TCP health check on the port of the service you want to path through the cluster (e.g., 8090)
4) On the LB, Create a loadbalancing rule (not a NAT rule!) for the service you want to allow inbound

5) On SmartConsole, create an access rule allowing traffic on this port to the private eth0 IP addresses of the members
6) On SmartConsole, for each clustere member, create a NAT rule from that port on eth0 to the native port on the application server
         a. At minimum, you should source NAT the LB healthprobes (globally originating from 168.63.129.16) to the address on eth1 of the respective member
         b. If you don’t mind stateless failover  you can source NAT everything (hide NAT) to the address of eth1

With this setup I tested that the UDR and cluster IP address do move automatically (it takes 1-2 mins) and at least new connections (inbound and outbound) succeed after the failover. 

HTH

0 Kudos
Attila_Bakos
Participant

I have created this and as a workaround its fine, but I do not see this solution in any documentation.

And also this only works with TCP

I have opened the following:

Service Request # 3-0150691191

0 Kudos
russell_perera
Participant

Hi Attila,

Have you heard anything back from your SR?  I have the same problem in my dev environment too.

0 Kudos
Yonatan_Philip
Employee Alumnus
Employee Alumnus

Hello Russell,

It looks like sk125435 was created for this issue.

Cause

Microsoft Azure has changed the API permissions that are used for updating a Public IP address. The API calls that are made from the azure_had.py script are no longer able to make the required calls to update the Public IP adress to the new active member.

Solution

This issue will be addressed in a new Microsoft Azure for deployment template. The new published template version will be released within 1-2 weeks as of April 23rd 2018.

If an immediate fix is required, contact Check Point Support and a Support Engineer will assist with a workaround for this issue.

HTH

Yonatan 

Dan_Morris
Employee
Employee

Just updated the SK. This is resolved in Version template 20180417.

Thank you,


Dan Morris

0 Kudos
Thiago_Bujnowsk
Explorer

Hello Daniel!

I've got a similiar problem. Everything works until I create a NAT rule in the loadbalance, but instead of getting permission errors in the logs, I get the same 400 bad request error but the message is:

2018-11-26 02:53:33,777-AZURE-CP-HA-INFO- Traceback (most recent call last):
File "/opt/CPsuite-R80/fw1/scripts/azure_had.py", line 557, in poll
setLocalActive()
File "/opt/CPsuite-R80/fw1/scripts/azure_had.py", line 535, in setLocalActive
todo |= set_cluster_ips()
File "/opt/CPsuite-R80/fw1/scripts/azure_had.py", line 391, in set_cluster_ips
body=json.dumps(peer_nic))[1]
File "/opt/CPsuite-R80/fw1/scripts/rest.py", line 503, in arm
max_time=self.max_time)
File "/opt/CPsuite-R80/fw1/scripts/rest.py", line 136, in request
headers['proto'], headers['code'], headers['reason'], response)
RequestException: HTTP/1.1 400 Bad Request
{
"error": {
"code": "InvalidResourceReference",
"message": "Resource /subscriptions/ddf46c4a-8920-403a-8e11-8561e1a7b7e9/resourceGroups/SECFW/providers/Microsoft.Network/networkInterfaces/SECFW1-eth0/ipConfigurations/cluster-vip referenced by resource /subscriptions/ddf46c4a-8920-403a-8e11-8561e1a7b7e9/resourceGroups/SECFW/providers/Microsoft.Network/virtualNetworks/Transit_VNET/subnets/WAN was not found. Please make sure that the referenced resource exists, and that both resources are in the same region.",
"details": []
}
}

Tried everything. Out of ideas..

Thanks in advance,

Thiago Bujnowski

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events