Azure based Vsec R80.10 Cluster - Secondary node issue
Hi, I have deployed an R80.10 Checkpoint Cluster into Microsoft Azure. Cluster XL is working (active/standby) and I can manage and push policies to both cluster nodes (inbound connectivity ok)
However when running the azure test script to check connectivity to Azure to make UDR and cluster IP changes the secondary node can't resolve DNS. Primary node works fine. If I try and ping 22.214.171.124 for example, I get no response as if the node has no outbound Internet connectivity not just a DNS issue. This is very odd because I can manage the cluster nodes and cluster XL is working but because the secondary node has no outbound connectivity failover is not working and also it can't contact checkpoint.com to get its contracts status so its complaining about licensing. Any ideas?
Output from the secondary node below which is unsuccessful.
Image version is: ogu_GAR1-289
Reading configuration file...
Testing if DNS is configured...
- Primary DNS server is: 126.96.36.199
Testing if DNS is working...
Failed to resolve login.windows.net
[Expert@vsec-node-2]# ping 188.8.131.52
PING 184.108.40.206 (220.127.116.11) 56(84) bytes of data.
--- 18.104.22.168 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2001ms
For anyone else who experiences this issue. The cause was that the secondary node was natting its own traffic behind the cluster IP address. The Cluster IP was assigned to the Primary node so assymetric routing was occurring.
The solution was a "no nat" rule on both vsec nodes so that traffic originating from itself is not hidden behind the cluster IP address but behind its own public IP address. I've not had to do this on my R77.30 vsecs so looks like a missing step from the R80.10 vsec guide.
Unfortunately failover is around 2 minutes.
ClusterXL fails over in seconds but the API calls to Azure to change routes and Cluster IP take their time. UDRs change pretty quickly in fairness but the disassociation of the cluster IP from the primary node and association to secondary node is the main delay
I have observed that even after 30-40 mins the disassociation of the cluster IP from the primary node and association to secondary node is not happening post Fail-Over.
Th Cluster Testing Script $FWDIR/scripts/azure_ha_test.py, Result in All tests were successful!
Everything as in:-
5. Route Tables
6. Load Balancers
7. Azure interface configuration
What Checkpoint version are you running? Have you run the test script on both nodes? I've only had failover/failback issues on R77.30 Azure based clusters, not had enough time with R80.10 yet. Typically this was because I had modified the inbound NAT rules on the Azure load balancer to include additional services such as SNMP which bizarrely seem to cause an issue or another Azure admin had restricted Public IPs at the subscription level so I couldn't associate the public IP to the secondary node.
You can manually associate the IP once it has disassociated. Not ideal but if you've got a 30-40 min outage and need to restore service it is possible.
The Cluster Testing Script $FWDIR/scripts/azure_ha_test.py, Result in All tests were successful! on both nodes.
The Inbound NAT Rules are not getting updated and the
Have done some troubleshooting ass follows:-
1. Removed all LB Inbound NAT Rules and did fail-over.
Result:- Success !! UDR's getting updated and points at M2 as next hop. Cluster VIP gets diassociate from M1 and Associates to M2 automatically.
2. Added 1 Inbound NAT Rule on LB. and did fail-over.
Result:- Failure !!, UDR's getting updated and points at M2 as next hop.
Cluster VIP does not move to M2
On Azure we can see activity log:-
- Operation nameWrite NetworkInterfaces FAILED
- Time stampFri Apr 27 2018 13:08:43 GMT+0530 (India Standard Time)
- Event initiated bycheck-point-cluster-ha-failover
- Error codeInvalidResourceReference
- MessageResource /subscriptions/d3e8c785-de15-4ba5-8afb-953e277061a2/resourceGroups/CPClust_RG/providers/Microsoft.Network/networkInterfaces/CPClust1-eth0/ipConfigurations/cluster-vip referenced by resource /subscriptions/d3e8c785-de15-4ba5-8afb-953e277061a2/resourceGroups/CPClust_RG/providers/Microsoft.Network/virtualNetworks/VNET01/subnets/Frontend was not found. Please make sure that the referenced resource exists, and that both resources are in the same region.ANY1 has encountered such error. I'm facing the same in 2 deployments.
There is a new sk125435 for this problem. It says that a new template will be published in a week or so.
There is a workaround that can be done today with a fixed azure_had.py that can be requested from TAC.
Got the fixed Script. Due to changes in the API permissions on Azure, new script need to be loaded on both the Cluster Members.
Also the Inbound NAT rules need to be pointed at Active Member's private IP and not the Cluster VIP.
Following the above, 2 NAT rules need to be implemented in Dashboard which receive the traffic on Member Ip's of both Cluster Member (when either of them are active) and not the Cluster VIP.