Solved: Traffic not working when failover is initiated on ...

the_rock · ‎2026-01-13

Hey guys,

Wondering if someone might be able to assist with this. I worked with customer last week to build brand new R82 cluster in Azure, we connected it to their brand new S1C and that was fine, built basic policy to test, but weird thing is, when they try from azure vm, all works fine when fw1 is master, but as soon as we fail over to fw 2, nothing works, no ping to google dns, no Internet.

Cphaprob state shows fine on both members...we even rebooted member 2, no dice. Oddly enough, no matter what we change in topology, same issue, though anti spoofing is disabled, as it should be. If we fail back to member 1, no issues.

Any idea why this happens? Its on R82 jumbo 39

Logs show that traffic is going through rule 3, as it should be. Alsom when fw 2 is active, connection from it is fine, can access updates.checkpoint.com (as an example) and ping 8.8.8.8

Appreciate the help, as always!

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2026-01-14

Just a quick update. I opened TAC case and engineer told me based on last few lines of $FWDIR/log/azure_had.log file, issue could be related to resource group settings for 2nd member, which does make sense to me, since that same error does not show up on the 1st member. Customer will verify and let me know.

Answer from TAC:

based on the Error i saw it means This notification indicates that the Check Point solution necessitates the Azure "Contributor" role (or a more specific role such as "Network Contributor" or "Virtual Machine Contributor") to execute essential actions, including creating, updating, or deleting resources.

Why is this requirement in place?

The Contributor role enables the application to handle resources (rather than merely accessing them). For CloudGuard, this capability is crucial for activities like provisioning gateways, updating network security groups, and managing IP addresses.If the cluster lacks Contributor permissions, any necessary failover actions for maintaining customer traffic continuity will not succeed, leading to outages when a cluster failover occurs.

missing permissions such as not having Contributor rights on the vNet or relevant Azure resources can directly cause customer traffic to fail during Azure cluster failover with Check Point CloudGuard

Best,
Andy
"Have a great day and if its not, change it"

View solution in original post

the_rock · ‎2026-01-13

FWIW, I ran my query through MS copilot AI and below is what it gave me. I verified CP side, looks good, but sent the rest to the client to check on Azure, just to verify.

*******************

Hey — this symptom pattern in Azure HA is super telling:

Forwarded traffic from workload VMs fails only when member2 is Active
But member2 itself can browse/ping out just fine

That almost always means Azure is dropping transit/forwarded packets (or the ILB/UDR is still sending flows to the wrong place) rather than a pure policy issue.

Below are the top 3 root causes I’d chase, in the order I’d chase them, with exact checks and what “good” looks like.

1) Azure IP Forwarding is not enabled on all NICs of member2 (classic)

If Azure NIC “IP forwarding” is off, the VM can still originate traffic (so fw2 itself can ping 8.8.8.8), but Azure will not allow it to forward traffic for other VMs. This matches your symptoms perfectly.

Check Point’s own Azure HA tester explicitly validates: “IP forwarding is enabled on all network interfaces of the Cluster Member.” [sc1.checkpoint.com]

What to do (fast)

On each cluster member, run:

Shell

expert

$FWDIR/scripts/azure_ha_test.py

Show more lines

If you see something like:

IP forwarding is not enabled on Interface ... …that’s your smoking gun. [sc1.checkpoint.com]

In Azure Portal (member2)

For each NIC attached to fw2 (external + internal, and any additional):

Networking → NIC → IP forwarding → Enabled

If member1 was deployed/configured earlier (or via a different template pass), it’s not uncommon for member2 NIC settings to be missed.

2) Internal LB health probe is wrong → ILB keeps sending flows to the Standby member

In Azure HA, load balancers are what steer inbound / east-west / outbound inspection, and they should forward only to the Active member.
If the probe is configured to “is the VM alive?” instead of “is the VM Active?”, the ILB can happily send workload traffic to the Standby gateway, which will drop it. [supportcen...kpoint.com], [sc1.checkpoint.com]

Check Point documents that their Health Probe Agent (R81.10+) is designed so that only the Active member answers probes in an HA cluster.
For Azure, the probe source is 168.63.129.16, and CloudGuard commonly listens on TCP/8117 for health probing. [support.ch...kpoint.com] [support.ch...kpoint.com], [learn.microsoft.com]

What to verify in Azure

In the Internal Load Balancer:

Health probe is TCP
Port is 8117
Backend pool contains both members’ relevant NICs
Your LB rule(s) reference that probe

Check Point’s Azure HA ATRG also calls out: ensure health probes using port 8117 are configured, otherwise gateways won’t respond correctly. [supportcen...kpoint.com]

Quick on-gateway validation

On fw2 (when it is Active), confirm it is answering the probe:

Shell

expert

fw ctl zdebug -m cluster cloud

Show more lines

Check Point recommends this debug to validate the Health Probe handling. [support.ch...kpoint.com]

Also make sure nothing blocks probes from 168.63.129.16 (NSG or policy) — Azure marks instances down if probes can’t reach the expected listening port. [learn.microsoft.com], [learn.microsoft.com]

3) Cluster VIP / egress IP attachment isn’t moving to member2 (AZURE_HAD / permissions)

In CloudGuard Azure HA, during failover the newly Active member associates the cluster private/public IPs to itself using Azure API calls.
If that mechanism is broken on member2 (permissions, AZURE_HAD down, API access blocked), you can end up with: [supportcen...kpoint.com]

fw2 becomes Active (ClusterXL shows fine)
fw2 can go out using its own interface IPs
but forwarded traffic that NATs behind the cluster egress IP/VIP gets dropped by Azure because that VIP is still attached to member1

The ATRG even notes expected outbound failover timing depends on LB probe + Azure API and can take longer than east-west.
But in your case it never recovers, so we’re looking for “VIP never moved”. [supportcen...kpoint.com]

What to check

On fw2:

Shell

expert

cpwd_admin getpid -name AZURE_HAD

Show more lines

You want a non-zero PID (daemon running). [supportcen...kpoint.com]

Then check logs:

$FWDIR/log/azure_had.log / $FWDIR/log/azure_had.elg* [supportcen...kpoint.com]

On the Azure side, during/after failover, verify:

The cluster private IP configuration (and public IP if used) is attached to fw2’s external NIC IP configurations
It is not still attached to fw1

A short, practical “do this now” runbook (10–15 minutes)

Step A — Prove whether traffic reaches fw2 and whether fw2 forwards it

When fw2 is Active, from a workload VM try:

ping 8.8.8.8
tracert 8.8.8.8 (Windows) / traceroute 8.8.8.8 (Linux)

On fw2:

Shell

expert

tcpdump -eni <inside_if> host <workload_vm_ip>

tcpdump -eni <outside_if> host 8.8.8.8

Show more lines

Interpretation:

See packets on inside but nothing on outside → forwarding/NAT path issue (often Azure IP forwarding or policy/NAT)
See packets on outside but no replies → SNAT/VIP attachment/return path issue

Step B — Run the Check Point Azure HA tester on fw2

Shell

expert

$FWDIR/scripts/azure_ha_test.py

Show more lines

This script checks the exact Azure HA prerequisites, including IP forwarding, and is the fastest way to catch mismatches between members. [sc1.checkpoint.com]

Step C — Validate ILB probe correctness

ILB probe should be TCP 8117 [support.ch...kpoint.com], [supportcen...kpoint.com]
Probe source 168.63.129.16 must not be blocked [learn.microsoft.com], [support.ch...kpoint.com]

Step D — Validate AZURE_HAD + VIP movement

cpwd_admin getpid -name AZURE_HAD on fw2 [supportcen...kpoint.com]
Check Azure NIC IP configs show VIP/cluster IP on fw2 after failover [supportcen...kpoint.com]

Most likely answer (based on your exact symptoms)

If I had to bet: Azure NIC IP forwarding is disabled on fw2 (or on one of fw2’s NICs). It’s the cleanest explanation for “fw2 can reach the internet but doesn’t forward for other VMs,” and Check Point explicitly calls out verifying/enabling IP forwarding on all interfaces as part of Azure HA validation. [sc1.checkpoint.com]

Second most likely: ILB probe not using 8117 / not “Active-only” probing, so the ILB still sends flows to fw1 even when fw2 is Active. [support.ch...kpoint.com], [supportcen...kpoint.com]

A couple quick questions (so I can pinpoint it faster)

Are your workload subnets’ UDRs pointing to an Internal Load Balancer frontend IP as next hop, or directly to fw1’s internal IP?
When fw2 is Active, does the cluster VIP / private cluster IP show attached to fw2’s NIC IP configurations in Azure?

If you paste the output of:

azure_ha_test.py (fw2)
cpwd_admin getpid -name AZURE_HAD
and a screenshot/text of the ILB probe config (protocol/port) …I can tell you exactly which of the above it is.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2026-01-14

Just a quick update. I opened TAC case and engineer told me based on last few lines of $FWDIR/log/azure_had.log file, issue could be related to resource group settings for 2nd member, which does make sense to me, since that same error does not show up on the 1st member. Customer will verify and let me know.

Answer from TAC:

based on the Error i saw it means This notification indicates that the Check Point solution necessitates the Azure "Contributor" role (or a more specific role such as "Network Contributor" or "Virtual Machine Contributor") to execute essential actions, including creating, updating, or deleting resources.

Why is this requirement in place?

The Contributor role enables the application to handle resources (rather than merely accessing them). For CloudGuard, this capability is crucial for activities like provisioning gateways, updating network security groups, and managing IP addresses.If the cluster lacks Contributor permissions, any necessary failover actions for maintaining customer traffic continuity will not succeed, leading to outages when a cluster failover occurs.

missing permissions such as not having Contributor rights on the vNet or relevant Azure resources can directly cause customer traffic to fail during Azure cluster failover with Check Point CloudGuard

Best,
Andy
"Have a great day and if its not, change it"

Are you a member of CheckMates?

Traffic not working when failover is initiated on Azure cluster