Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Timur135653
Explorer

Technical clarification: ClusterXL HA failover behavior and ICMP source IP

Hello CheckMates community,

I am currently labbing a Check Point ClusterXL High Availability setup and would like to get a deeper technical understanding of the observed failover behavior.

Environment:

  • Version: R82

  • Platform: Open Server (VMware)

  • Deployment: ClusterXL HA (Active/Standby)

Observed behavior:

  1. Packet Loss during Failover: When the Active node is powered off, I observe a loss of exactly one ICMP packet (ping 8.8.8.8) from a host behind the gateway.

  2. MAC Address Change: Looking at the Windows host ARP table, I see that the VIP (.100) updates its MAC address from Node A's physical MAC to Node B's physical MAC immediately after the failover.

  3. Traceroute Source IP: When performing a tracert, the first hop shows the physical IP of the active member (.10 or .20) instead of the Cluster VIP (.100).

My questions:

  1. Is the 1-packet drop considered expected behavior for a standard G-ARP based failover on Open Servers?

  2. Does Check Point design necessitate the use of physical MACs for the VIP by default, and is VMAC the only way to achieve "zero-drop" failover in this environment?

  3. Why does the gateway respond with its physical IP for ICMP "Time Exceeded" messages instead of the VIP, and is there a technical reason for this default behavior?

I’ve attached screenshots of my SmartConsole topology, cphaprob stat output, and the host's CMD logs for reference.

Looking forward to your insights!

0 Kudos
3 Replies
israelfds95
MVP Gold
MVP Gold

 

1 - Yes, losing a single ICMP packet during failover is expected in standard ClusterXL HA using G-ARP.
2 - Yes, VMAC is the way to solve and improve this situation.
3 - Basically, in this scenario, it’s the gateway itself (the physical IP), not the VIP, that responds to the ICMP (such as "Time Exceeded" in traceroute). This is the default behavior for most network devices, including Check Point.

imagem (3).png

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Are you trying to replicate a physical environment or another virtual one?

Have you watched the cluster messages in detail and what is your ping interval/timeout?

VMAC has its related benefits for some fault cases but this is not a problem it solves entirely. Rather it overcomes an ARP learning challenge it does not reduce the clusterXL dead timeout where a scenario induces it.

CCSM R77/R80/ELITE
0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

In regard to point 1, see my CPX presentation, Be your own TAC Part Deux, which discusses expected traffic behavior after what I call a ClusterXL "catastrophic failover" vs. a "non-catastrophic failover".  In your case, when you unplugged the active, it was a catastrophic failover, and it is expected to lose one ping packet with the default ping response timeout of 4000 milliseconds. 

However, the loss is not due to the use of G-ARP versus VMAC, and changing this setting will not have any effect on the issue. The loss is caused by what I call the "dead timer" (borrowing from OSPF here), which is usually about 2.5 seconds on the standby, but can be much longer (with significantly more packet loss) if the cluster is currently "under load" at the time of a catastrophic failover.  The dead timer can be shortened, but it may cause unnecessary failovers, which are quite a bit more impactful on certain inspected traffic than you might think. See the presentation as well for more details on this.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events