Solved: ClusterXL - standby cannot reach gateway

Richard_Scott1 · ‎2018-09-10

I've got a R77.30 cluster of two nodes (running on vmware).

The active node can ping the default gateway and onward to the rest of the network without any issue.

However, the standby node can't even ping the gateway, let alone anything beyond it. If I unload the policy from the node, then it is able ping it.

Logs suggest the traffic is being nat'd to the cluster's address. The gateway can ping active, standby and cluster addresses.

I've tried fw ctl set int fwha_forw_packet_to_not_active 1 on both nodes, but that didn't help.

The management interface is reachable via a different gateway (and static route).

Any suggestions greatly appreciated!

JH_Ranger · ‎2023-11-05

Thanks Scott. I spent 2 weeks troubleshooting this issue.

I have found it strange that when doing a TCPDump on the SYNC interface, the clusterXL control traffic was visible (8116/TCP) on both the standby and active firewalls, but DNS queries, HTTPS requests and other traffic was only seen on the standby (blocked by the VMWare switch due to coming from a different source MAC).

After setting the "Forged Transmits" to "Accept", everything works as expected.

View solution in original post

PhoneBoy · ‎2018-09-10

I'm pretty sure that's the wrong "fix" for the problem, reading this sk: ClusterXL drops traffic with 'dropped by fwha_forw_run Reason: Failed to send to another cluste...

But perhaps the debugging in this SK might be helpful in figuring out where the true problem is.

Richard_Scott1 · ‎2018-09-10

Thanks. I've tried the 'fw zdebug' command both with and without that "fix" (currently disabled).

Only error displayed is :-

;fw_log_drop_ex: Packet proto=-1 ?:0 -> ?:0 dropped by fwha_select_arp_packet Reason: CPHA replies to arp;

PhoneBoy · ‎2018-09-10

There's at least one TAC case that mentions this error message and the fix for it being the kernel variable you set it.

Which suggests a TAC case might be in order for further troubleshooting.

Richard_Scott1 · ‎2018-09-10

ok - thanks. I'll get a case logged.

Maarten_Sjouw · ‎2018-09-10

One of the things with running clusters in VM-Ware is that they run better with VRRP. Then you can also control this behaviour from the Dashboard.

Regards, Maarten

Richard_Scott1 · ‎2018-09-11

The decision not to use VRRP was taken by a Checkpoint engineer, not us. They migrated us from an old (R6x) physical cluster to the new R77.30 VMware based one.

Support ticket has been responded to, asking for any error shown by zdebug drop... even though I'd included that in the original ticket.

I'm now waiting for them to set up a remote session..

Maarten_Sjouw · ‎2018-09-11

Yeah I know, when you open the ticket, you upload the cpinfo, first thing they ask for is? Yep a cpinfo.

Regards, Maarten

Richard_Scott1 · ‎2018-09-11

End of the working day for me here in the UK and no progress. Updated the ticket 8 hours ago with suitable timeslots for remote access, but not heard anything at all..

Maarten_Sjouw · ‎2018-09-11

If you have the opportunity, switch to VRRP and see what comes of it.

Be aware that in the VM switch you have to disable all security features of the ports connected to your FW's and make sure IGMP snooping is allowed.

When you do choose VRRP do not use extended vMAC just the standard vMAC mode.

Last resort, try if you need to change from multicast to broadcast for ccc protocol.

Regards, Maarten

Richard_Scott1 · ‎2018-09-12

It's a production firewall, so slightly hesitant to switch to VRRP yet..

Had a ticket update overnight, to say someone else is working it and asking me questions I've already answered 🙂

Richard_Scott1 · ‎2018-09-13

Not impressed with support at all.

Spent an hour on the phone and clearly explained that the problem affects all outgoing traffic from the standby node (specifically NTP, DNS and HTTPS) but the tech has focused solely on NTP and wants screenshots of the NTP configuration, diagnostics of NTP, etc.

Completely ignoring the basic fault that there's zero outbound connectivity to anything via the default gateway.

G_W_Albrecht · ‎2018-09-13

What i would want to know is the current business impact of this issue - are any TP updates not working on the standby node ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Richard_Scott1 · ‎2018-09-13

Immediate impact is that the standby can't get any checkpoint updates, sync time. We're also looking at deploying the IPS blade onto it, but can't while the standby can get out of sync for updates, etc.

G_W_Albrecht · ‎2018-09-13

Do you already know sk43807 Anti-Virus / URL Filtering / IPS update fails on the Standby member of ClusterXL in High Ava...?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Networks_Winter · ‎2019-03-21

Hi,

We have the same issue where the secondary node is not able to reach the next hop gateway. Did you come to resolution to your issue?

Daniel_Bourne · ‎2019-03-21

Hello! We are having the same issue, were you able to get this resolved?

Thanks,

Vincent_Van_Bru · ‎2019-06-26

Hello,
was there any feedback from TAC on this ?
Thanks.

Gregory_Link · ‎2019-09-25

We are having this exact same issue. Did anyone find a resolution? We have firewall appliances though and not VMs.

Ruan_Kotze · ‎2019-09-25

Hi,

Have a look at my thread over here: https://community.checkpoint.com/t5/Enterprise-Appliances-and-Gaia/Connectivity-issues-from-standby-...

My issue occurred after upgrading the environment from R80.30. We ended up having to follow step 4 in sk43807 (also mentioned by @G_W_Albrecht ).

Thanks,
Ruan

Gregory_Link · ‎2019-09-27

So, not exactly the same issue as we are only running clusterXL and not VRRP. Working with support and our Sales Engineer now but will update this post with our results. Checkpoint currently thinks it's a network issue because it can see DNS requests going out but no replies on the standby.

Diego_dg · ‎2021-07-07

Hi! we are having the same issue (arp dropped by CPHA) but on the active node with message:

;fw_log_drop_ex: Packet proto=-1 ?:0 -> ?:0 dropped by fwha_select_arp_packet Reason: CPHA replies to arp;

I have contacted TAC and have a SR but this is an old installation on R77.30 (that we are going to upgrade soon, but in the meantime we need to fix this issue)

Did anyone find the cause for this drops? Best regards

Scott_Paisley · ‎2023-04-04

I know this thread is nearly 5 years old, but I don't see a solution, and we hit exactly the same issue

R81.10 machines running on ESXi VM hosts, secondary can't ping the gateway unless the policy is unloaded. Gateway management traffic works fine, probably because it doesn't pass through the policy.

The standby box actually tries to pass external traffic through the active box using the sync connection, which is designed behaviour I believe.

My colleague found a setting on the vSwitch in ESX that seems to be cauing the problem. Under policies, there is a setting for 'Forged transmits'. The default is Reject. Setting it to Accept on the VLAN the Sync traffic uses seems to be working now

The checkpoint uses some kind of virtual MAC for that traffic that the vSwitch doesn't like, so it drops it apparently

Chris_Atkinson · ‎2023-04-04

Note these settings for VMware are documented in sk101214.

CCSM R77/R80/ELITE

Scott_Paisley · ‎2023-09-07

Hi. Is there an equivalent for NSX-T? We just hit exactly the same issue, but we can't resolve it the same way as there is no access to the ESX vSwitch in this environment, only the NSX-T overlay settings.

Diego_dg · ‎2023-04-04

Hi, i remember that the message about fwha_select_arp_packet was the expected behaviour, in our case it seems that the issue was not related with Checkpoint....

JH_Ranger · ‎2023-11-05

Thanks Scott. I spent 2 weeks troubleshooting this issue.

I have found it strange that when doing a TCPDump on the SYNC interface, the clusterXL control traffic was visible (8116/TCP) on both the standby and active firewalls, but DNS queries, HTTPS requests and other traffic was only seen on the standby (blocked by the VMWare switch due to coming from a different source MAC).

After setting the "Forged Transmits" to "Accept", everything works as expected.

Are you a member of CheckMates?

ClusterXL - standby cannot reach gateway