Solved: Re: ClusterXL Failover issues

StevePearson · ‎2025-04-02

I'm looking into a strange failover issue with a pair of 5200 gateways. The config is as follows:

Gateways clustered using clusterXL. A single port of each is configured as the WAN (so no bonding) and these are connected to a cisco switch in layer 2 mode. The single internet router is also connected to this switch. A single port on each is configured for the LAN (so again no bonding), these are connected to a cisco switch in layer 3 mode (this is the network core switch). The gateways are running R81.20 Take 92, but this issue has been around through several JHF and even in pervious Gaia R80.30.

All internet traffic in and out flows through these gateways. There are several externally available services including several web servers and a third party remote access server. These are configured in what to me is an unusual way. There are 2 objects for each, one object has the internal address, the other has the external address, and there are a pair of manual NAT rules for each pair. Then, the external addresses are all added as alias addresses to the external interface in Gaia.

In normal operation with the primary up, everything works fine, however, when you failover to the secondary the webservers and the remote access service are not accessible from outside (although you can telnet to the RAS box!). Outbound traffic remains unaffected. Failing back resolves the issue. The issue appears and disappears almost instantly.

I suspect either NAT or ARP, however switching the cluster to virtual MAC and rebooting the internet router to clear the ARP table didn't have any effect.

I need to get some out of hours downtime to troubleshoot this and on site as it disconnects my remote session when failed over.

Any suggestions would be welcome 🙂

Thanks

Bob_Zimmerman · ‎2025-04-02

Instead of aliases, the additional addresses should be added to proxy ARP. In clish:

add arp proxy ipv4-address <IP address> interface <interface name> real-ipv4-address <interface address>

<IP address> is the address you want the firewall to claim to own. <interface name> is the interface you want to claim to own that address. <interface address> is the IP address of the interface. Remember to add the statement to both cluster members, and to change the value in <interface address> based on the real IP the interface has on both members.

Aliases don't interact with clustering, so things would get the real member MAC, which is why VMAC didn't help.

View solution in original post

Bob_Zimmerman · ‎2025-04-02

Instead of aliases, the additional addresses should be added to proxy ARP. In clish:

add arp proxy ipv4-address <IP address> interface <interface name> real-ipv4-address <interface address>

<IP address> is the address you want the firewall to claim to own. <interface name> is the interface you want to claim to own that address. <interface address> is the IP address of the interface. Remember to add the statement to both cluster members, and to change the value in <interface address> based on the real IP the interface has on both members.

Aliases don't interact with clustering, so things would get the real member MAC, which is why VMAC didn't help.

StevePearson · ‎2025-04-02

Thanks for the detailed reply!

I was thinking of removing each pair of objects and their pair of manual NAT, then remove the alias before creating a new object with the internal address and static natting it to the required external IP. This is the way I would normally do it.

AkosBakos · ‎2025-04-02

Hi @StevePearson

Maybe you ran into a limitation:

The use of secondary IP addresses is not supported in ClusterXL or VRRP Cluster on Gaia OS

https ://support.checkpoint.com/results/sk/sk89980

Use proxy arp as @Bob_Zimmerman told, this is the widely accepted solution.

Akos

----------------
\m/_(>_<)_\m/

the_rock · ‎2025-04-02

I agree, sounds like proxy arp to me as well, though dont see many people having to do that in R81.20

Andy

Best,
Andy
"Have a great day and if its not, change it"

StevePearson · ‎2025-04-02

Hi Andy,

I think it's because the policy dates back to at least R77, probably R65, it was originally a single box with on box management, now it's a cluster with separate management. I think a full policy review is called for, and the 5200 need to be replaced soon so clean rebuilds too.

the_rock · ‎2025-04-02

Ah, gotcha, makes sense then.

Andy

Best,
Andy
"Have a great day and if its not, change it"

Bob_Zimmerman · ‎2025-04-02

That can also work, especially for environments without decades of legacy configuration cruft which needs to be maintained.

StevePearson · ‎2025-06-11

So I finally got back to site yesterday to have another look at this. I removed the 2 separate objects, manual NAT and the Alias entry from Gaia on both FW boxes. I added a new single object with Static NAT behind the required IP. Pushed the policy and tested to ensure the system worked as expected, which it did. I then failed over to FW2. This caused the service to fail immediately. Failed back and all ok again.

Checking the ARP table in the Internet router, I see that the cluster IP is shown with the VMAC (as expected) but the IP for the service is showing the MAC of FW1, which is why this failed. What I don't understand is why this is not using the VMAC. Checking the ARP again today, so 10 hours or so later, i'm seeing the same entries, so it's not simply an ARP table refresh issue.

Are you a member of CheckMates?

ClusterXL Failover issues