Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
the_rock
MVP Diamond
MVP Diamond
Jump to solution

Traffic not working when failover is initiated on Azure cluster

Hey guys,

Wondering if someone might be able to assist with this. I worked with customer last week to build brand new R82 cluster in Azure, we connected it to their brand new S1C and that was fine, built basic policy to test, but weird thing is, when they try from azure vm, all works fine when fw1 is master, but as soon as we fail over to fw 2, nothing works, no ping to google dns, no Internet.

Cphaprob state shows fine on both members...we even rebooted member 2, no dice. Oddly enough, no matter what we change in topology, same issue, though anti spoofing is disabled, as it should be. If we fail back to member 1, no issues.

Any idea why this happens? Its on R82 jumbo 39

Logs show that traffic is going through rule 3, as it should be. Alsom when fw 2 is active, connection from it is fine, can access updates.checkpoint.com (as an example) and ping 8.8.8.8

Appreciate the help, as always!

Screenshot_1.png

 

Screenshot_2.png

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos
1 Solution

Accepted Solutions
the_rock
MVP Diamond
MVP Diamond

Just a quick update. I opened TAC case and engineer told me based on last few lines of $FWDIR/log/azure_had.log file, issue could be related to resource group settings for 2nd member, which does make sense to me, since that same error does not show up on the 1st member. Customer will verify and let me know.

Answer from TAC:

based on the Error i saw it means This notification indicates that the Check Point solution necessitates the Azure "Contributor" role (or a more specific role such as "Network Contributor" or "Virtual Machine Contributor") to execute essential actions, including creating, updating, or deleting resources.

Why is this requirement in place?

The Contributor role enables the application to handle resources (rather than merely accessing them). For CloudGuard, this capability is crucial for activities like provisioning gateways, updating network security groups, and managing IP addresses.If the cluster lacks Contributor permissions, any necessary failover actions  for maintaining customer traffic continuity will not succeed, leading to outages when a cluster failover occurs.

missing permissions such as not having Contributor rights on the vNet or relevant Azure resources can directly cause customer traffic to fail during Azure cluster failover with Check Point CloudGuard

Best,
Andy
"Have a great day and if its not, change it"

View solution in original post

0 Kudos
2 Replies
the_rock
MVP Diamond
MVP Diamond

FWIW, I ran my query through MS copilot AI and below is what it gave me. I verified CP side, looks good, but sent the rest to the client to check on Azure, just to verify.

*******************

 

Hey — this symptom pattern in Azure HA is super telling:
  • Forwarded traffic from workload VMs fails only when member2 is Active
  • But member2 itself can browse/ping out just fine
That almost always means Azure is dropping transit/forwarded packets (or the ILB/UDR is still sending flows to the wrong place) rather than a pure policy issue.
Below are the top 3 root causes I’d chase, in the order I’d chase them, with exact checks and what “good” looks like.

1) Azure IP Forwarding is not enabled on all NICs of member2 (classic)
If Azure NIC “IP forwarding” is off, the VM can still originate traffic (so fw2 itself can ping 8.8.8.8), but Azure will not allow it to forward traffic for other VMs. This matches your symptoms perfectly.
Check Point’s own Azure HA tester explicitly validates: “IP forwarding is enabled on all network interfaces of the Cluster Member.” [sc1.checkpoint.com]
What to do (fast)
On each cluster member, run:
Shell
expert
$FWDIR/scripts/azure_ha_test.py
Show more lines
If you see something like:
  • IP forwarding is not enabled on Interface ... …that’s your smoking gun. [sc1.checkpoint.com]
In Azure Portal (member2)
For each NIC attached to fw2 (external + internal, and any additional):
  • Networking → NIC → IP forwarding → Enabled
If member1 was deployed/configured earlier (or via a different template pass), it’s not uncommon for member2 NIC settings to be missed.

2) Internal LB health probe is wrong → ILB keeps sending flows to the Standby member
In Azure HA, load balancers are what steer inbound / east-west / outbound inspection, and they should forward only to the Active member.
If the probe is configured to “is the VM alive?” instead of “is the VM Active?”, the ILB can happily send workload traffic to the Standby gateway, which will drop it. 
[supportcen...kpoint.com][sc1.checkpoint.com]
Check Point documents that their Health Probe Agent (R81.10+) is designed so that only the Active member answers probes in an HA cluster.
For Azure, the probe source is 168.63.129.16, and CloudGuard commonly listens on TCP/8117 for health probing. 
[support.ch...kpoint.com] [support.ch...kpoint.com][learn.microsoft.com]
What to verify in Azure
In the Internal Load Balancer:
  • Health probe is TCP
  • Port is 8117
  • Backend pool contains both members’ relevant NICs
  • Your LB rule(s) reference that probe
Check Point’s Azure HA ATRG also calls out: ensure health probes using port 8117 are configured, otherwise gateways won’t respond correctly. [supportcen...kpoint.com]
Quick on-gateway validation
On fw2 (when it is Active), confirm it is answering the probe:
Shell
expert
fw ctl zdebug -m cluster cloud
 
Show more lines
Check Point recommends this debug to validate the Health Probe handling. [support.ch...kpoint.com]
Also make sure nothing blocks probes from 168.63.129.16 (NSG or policy) — Azure marks instances down if probes can’t reach the expected listening port. [learn.microsoft.com][learn.microsoft.com]

3) Cluster VIP / egress IP attachment isn’t moving to member2 (AZURE_HAD / permissions)
In CloudGuard Azure HA, during failover the newly Active member associates the cluster private/public IPs to itself using Azure API calls.
If that mechanism is broken on member2 (permissions, AZURE_HAD down, API access blocked), you can end up with: 
[supportcen...kpoint.com]
  • fw2 becomes Active (ClusterXL shows fine)
  • fw2 can go out using its own interface IPs
  • but forwarded traffic that NATs behind the cluster egress IP/VIP gets dropped by Azure because that VIP is still attached to member1
The ATRG even notes expected outbound failover timing depends on LB probe + Azure API and can take longer than east-west.
But in your case it never recovers, so we’re looking for “VIP never moved”. 
[supportcen...kpoint.com]
What to check
On fw2:
Shell
expert
cpwd_admin getpid -name AZURE_HAD
Show more lines
You want a non-zero PID (daemon running). [supportcen...kpoint.com]
Then check logs:
On the Azure side, during/after failover, verify:
  • The cluster private IP configuration (and public IP if used) is attached to fw2’s external NIC IP configurations
  • It is not still attached to fw1

A short, practical “do this now” runbook (10–15 minutes)
Step A — Prove whether traffic reaches fw2 and whether fw2 forwards it
When fw2 is Active, from a workload VM try:
  • ping 8.8.8.8
  • tracert 8.8.8.8 (Windows) / traceroute 8.8.8.8 (Linux)
On fw2:
Shell
expert
tcpdump -eni <inside_if> host <workload_vm_ip>
tcpdump -eni <outside_if> host 8.8.8.8
Show more lines
Interpretation:
  • See packets on inside but nothing on outside → forwarding/NAT path issue (often Azure IP forwarding or policy/NAT)
  • See packets on outside but no replies → SNAT/VIP attachment/return path issue
Step B — Run the Check Point Azure HA tester on fw2
Shell
expert
$FWDIR/scripts/azure_ha_test.py
 
Show more lines
This script checks the exact Azure HA prerequisites, including IP forwarding, and is the fastest way to catch mismatches between members. [sc1.checkpoint.com]
Step C — Validate ILB probe correctness
Step D — Validate AZURE_HAD + VIP movement

Most likely answer (based on your exact symptoms)
If I had to bet: Azure NIC IP forwarding is disabled on fw2 (or on one of fw2’s NICs). It’s the cleanest explanation for “fw2 can reach the internet but doesn’t forward for other VMs,” and Check Point explicitly calls out verifying/enabling IP forwarding on all interfaces as part of Azure HA validation. [sc1.checkpoint.com]
Second most likely: ILB probe not using 8117 / not “Active-only” probing, so the ILB still sends flows to fw1 even when fw2 is Active. [support.ch...kpoint.com][supportcen...kpoint.com]

A couple quick questions (so I can pinpoint it faster)
  1. Are your workload subnets’ UDRs pointing to an Internal Load Balancer frontend IP as next hop, or directly to fw1’s internal IP?
  2. When fw2 is Active, does the cluster VIP / private cluster IP show attached to fw2’s NIC IP configurations in Azure?
If you paste the output of:
  • azure_ha_test.py (fw2)
  • cpwd_admin getpid -name AZURE_HAD
  • and a screenshot/text of the ILB probe config (protocol/port) …I can tell you exactly which of the above it is.
Best,
Andy
"Have a great day and if its not, change it"
0 Kudos
the_rock
MVP Diamond
MVP Diamond

Just a quick update. I opened TAC case and engineer told me based on last few lines of $FWDIR/log/azure_had.log file, issue could be related to resource group settings for 2nd member, which does make sense to me, since that same error does not show up on the 1st member. Customer will verify and let me know.

Answer from TAC:

based on the Error i saw it means This notification indicates that the Check Point solution necessitates the Azure "Contributor" role (or a more specific role such as "Network Contributor" or "Virtual Machine Contributor") to execute essential actions, including creating, updating, or deleting resources.

Why is this requirement in place?

The Contributor role enables the application to handle resources (rather than merely accessing them). For CloudGuard, this capability is crucial for activities like provisioning gateways, updating network security groups, and managing IP addresses.If the cluster lacks Contributor permissions, any necessary failover actions  for maintaining customer traffic continuity will not succeed, leading to outages when a cluster failover occurs.

missing permissions such as not having Contributor rights on the vNet or relevant Azure resources can directly cause customer traffic to fail during Azure cluster failover with Check Point CloudGuard

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    Thu 07 May 2026 @ 01:30 PM (AEST)

    CheckMates Live Sydney

    Tue 02 Jun 2026 @ 09:00 AM (CEST)

    CheckMates Live Denmark - Aarhus

    Wed 03 Jun 2026 @ 09:00 AM (CEST)

    CheckMates Live Denmark - Copenhagen
    CheckMates Events