Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Alex_Fray
Participant

Azure based Vsec R80.10 Cluster - Secondary node issue

Hi, I have deployed an R80.10 Checkpoint Cluster into Microsoft Azure. Cluster XL is working (active/standby) and I can manage and push policies to both cluster nodes (inbound connectivity ok)

However when running the azure test script to check connectivity to Azure to make UDR and cluster IP changes the secondary node can't resolve DNS. Primary node works fine. If I try and ping 8.8.8.8 for example, I get no response as if the node has no outbound Internet connectivity not just a DNS issue. This is very odd because I can manage the cluster nodes and cluster XL is working but because the secondary node has no outbound connectivity failover is not working and also it can't contact checkpoint.com to get its contracts status so its complaining about licensing. Any ideas?

Output from the secondary node below which is unsuccessful.
[Expert@vsec-node-2]# $FWDIR/scripts/azure_ha_test.py
Image version is: ogu_GAR1-289
Reading configuration file...
Testing if DNS is configured...
 - Primary DNS server is: 8.8.8.8
Testing if DNS is working...
Error:
Failed to resolve login.windows.net

!

[Expert@vsec-node-2]# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2001ms

12 Replies
Alex_Fray
Participant

For anyone else who experiences this issue. The cause was that the secondary node was natting its own traffic behind the cluster IP address. The Cluster IP was assigned to the Primary node so assymetric routing was occurring.

The solution was a "no nat" rule on both vsec nodes so that traffic originating from itself is not hidden behind the cluster IP address but behind its own public IP address. I've not had to do this on my R77.30 vsecs so looks like a missing step from the R80.10 vsec guide.

Neville_Kuo
Advisor

Hi:

How much time does failover take? SK says under 2 minutes but I think that's too long .

Alex_Fray
Participant

Unfortunately failover is around 2 minutes.

ClusterXL fails over in seconds but the API calls to Azure to change routes and Cluster IP take their time. UDRs change pretty quickly in fairness but the disassociation of the cluster IP from the primary node and association to secondary node is the main delay

Nikhil_Deshmukh
Contributor

I have observed that even after 30-40 mins the disassociation of the cluster IP from the primary node and association to secondary node is not happening post Fail-Over.

Th Cluster Testing Script $FWDIR/scripts/azure_ha_test.py, Result in All tests were successful!

Everything as in:-

1. DNS

2. login.windows.net:443

3. Interfaces

4. Credentials

5. Route Tables

6. Load Balancers

7. Azure interface configuration

Any Suggestions??

0 Kudos
Alex_Fray
Participant

What Checkpoint version are you running? Have you run the test script on both nodes? I've only had failover/failback issues on R77.30 Azure based clusters, not had enough time with R80.10 yet. Typically this was because I had modified the inbound NAT rules on the Azure load balancer to include additional services such as SNMP which bizarrely seem to cause an issue or another Azure admin had restricted Public IPs at the subscription level so I couldn't associate the public IP to the secondary node.

You can manually associate the IP once it has disassociated. Not ideal but if you've got a 30-40 min outage and need to restore service it is possible.

0 Kudos
Nikhil_Deshmukh
Contributor

Checkpoint R80.10.

The Cluster Testing Script $FWDIR/scripts/azure_ha_test.py, Result in All tests were successful! on both nodes.

The Inbound NAT Rules are not getting updated and the disassociation of the cluster IP from the primary node and association to secondary node is not happening post Fail-Over.

Nikhil_Deshmukh
Contributor

Have done some troubleshooting ass follows:-

1. Removed all LB Inbound NAT Rules and did fail-over.

Result:- Success !! UDR's getting updated and points at M2 as next hop. Cluster VIP gets diassociate from M1 and Associates to M2 automatically.

2. Added 1 Inbound NAT Rule on LB. and did fail-over.

Result:- Failure !!, UDR's getting updated and points at M2 as next hop. 

Cluster VIP does not move to M2

On Azure we can see activity log:-

  • Operation name
    Write NetworkInterfaces FAILED Smiley Sad
  • Time stamp
    Fri Apr 27 2018 13:08:43 GMT+0530 (India Standard Time)
  • Event initiated by
    check-point-cluster-ha-failover
  • Error code
    InvalidResourceReference
  • Message
    Resource /subscriptions/d3e8c785-de15-4ba5-8afb-953e277061a2/resourceGroups/CPClust_RG/providers/Microsoft.Network/networkInterfaces/CPClust1-eth0/ipConfigurations/cluster-vip referenced by resource /subscriptions/d3e8c785-de15-4ba5-8afb-953e277061a2/resourceGroups/CPClust_RG/providers/Microsoft.Network/virtualNetworks/VNET01/subnets/Frontend was not found. Please make sure that the referenced resource exists, and that both resources are in the same region.
    ANY1 has encountered such error. I'm facing the same in 2 deployments.
Arnfinn_Strand
Employee
Employee

There is a new sk125435 for this problem. It says that a new template will be published in a week or so.

There is a workaround that can be done today with a fixed azure_had.py that can be requested from TAC.

Arnfinn

0 Kudos
Nikhil_Deshmukh
Contributor

Thanks Arnfinn StrandSmiley Happy

Checking with the TAC. But there's a little delay from them.

Do you have any source of the fixed azure_ha.py script ?

0 Kudos
Nikhil_Deshmukh
Contributor

Got the fixed Script. Due to changes in the API permissions on Azure, new script need to be loaded on both the Cluster Members.

Also the Inbound NAT rules need to be pointed at Active Member's private IP and not the Cluster VIP.

Following the above, 2 NAT rules need to be implemented in Dashboard which receive the traffic on Member Ip's of both Cluster Member (when either of them are active) and not the Cluster VIP.

Arnfinn_Strand
Employee
Employee

Only source I know is TAC. Sorry

0 Kudos
Neville_Kuo
Advisor

I think that's too slow for mission critical services, I would rather suggest our customer use higher level vm size for check point.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.