Solved: Re: Connectivity issues from standby gateway after...

Ruan_Kotze · ‎2019-09-11

Good day,

I have recently completed an upgrade from R80.10 to R80.30 (Management + 2 gateways in HA cluster). The upgrade itself was successful but I have noticed one issue on the standby gateway. We cannot ping or do NSlookups etc from the standby node. License checks also fails on this node.

What I have attempted thus far:

Set the "fw ctl set int fwha_forw_packet_to_not_active 1" on both gateways
Followed the guidance in sk147093 (fw ctl zdebug output matched that in the SK, as per below, IP sanitised)

121670435;[cpu_1];[SIM-207375815];update_tcp_state: invalid state detected (current state: 0x10000, th_flags=0x10, cdir=0) -> dropping packet, conn: [<1.1.1.1,2022,2.2.2.2,88,6>][PPK0];
@;121670435;[cpu_1];[SIM-207375815];sim_pkt_send_drop_notification: (0,0) received drop, reason: general reason, conn:

It is important to note that all connectivity is restored when I do a fw unloadlocal. There has also been no changes to either NAT or firewall policies.

I've found a couple of posts on Checkmates describing similiar issue, but unfortunately no resolution apart from the steps above.

I will also log a TAC case, but hoping to hear if anyone has experienced similiar issues after an upgrade?

Thanks,

Ruan

Ruan_Kotze · ‎2019-09-22

Hi Everyone,

We worked with TAC and manage to resolve the issue. In the end we had to follow step 4 in sk43807. All updates etc are working and all warnings in Smartconsole have been cleared.

Cheers,

Ruan

View solution in original post

Maarten_Sjouw · ‎2019-09-11

All R80.30 gateway clusters we run are using VRRP and I can set this NAT function on the cluster object and still do not understand why this option is not available for ClusterXL.
I really don't.

Regards, Maarten

G_W_Albrecht · ‎2019-09-11

Looks like sk147493 - seems no R80.30 Jumbo has this fix yet...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Ruan_Kotze · ‎2019-09-22

I understood from TAC that there is a hotfix available, but they prefer not to deploy as it might be overwritten by the next Jumbo, causing behaviour regression.

Ruan_Kotze · ‎2019-09-12

I have opened a case with TAC. They seemed surprised that the kernel parameter did not fix the issue, I will update this thread once we have a resolution.

G_W_Albrecht · ‎2019-09-12

What i do wonder is why this is regarded as an issue ? Usually, i do not issue ping nor nslookup from the CLI of standby cluster members - or is there a very good reason for that ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Maarten_Sjouw · ‎2019-09-12

The nslookup prevents that the gateway has access to the Check Point cloud, so when there is a failover, many things need to get their updates at that moment...
Meaning there is no updated URL/APCL database, IPS (when set to the gateway gets it by itself), Dynamic objects will fail during the first minute.
Also Cpuse will not be able to show you the list of available downloads, so when you want to update the cluster with the latest Jumbo, you need to make the member master first wait for it to get the update list etc etc.

Regards, Maarten

Ruan_Kotze · ‎2019-09-22

Hi Everyone,

We worked with TAC and manage to resolve the issue. In the end we had to follow step 4 in sk43807. All updates etc are working and all warnings in Smartconsole have been cleared.

Cheers,

Ruan

Jon_Moss · ‎2020-05-12

I'm seeing this issue still in R80.40, is the 'solution' listed here by Ruan still considered the accepted approach to remediate?

thanks

Adam276 · ‎2020-10-16

Same issue for me with a new cluster R80.40 on OpenPlatform (Dell R640). Whichever member is in standby state cannot establish outbound connections (UDP, TCP, ICMP, etc) . The odd thing is that only traffic generated from the firewall going outbound is having the issue. Traffic that establishes a connection to the firewall (pings, ssh, etc) is working just fine from external IPs on the Internet.

I checked the SYNC interface and the local STANDBY member's generated traffic to the Internet is being forwarded to the ACTIVE member over the SYNC interface and ignoring the default gateway. I am seeing the source external interface IP of the STANDBY member and destination of the external google server I am pinging and the destination MAC address of the ACTIVE member's sync interface IP. This is with a simple ping to google. DNS to an external DNS server doesn't work either as that is also forwarded to the ACTIVE member.

This seems like pretty basic stuff that needs to work. I don't have this problem on clusters running R80.10 or earlier (I have never used R80.10 or R80.20 on gateways so I don't know about them). We really shouldn't need any workarounds for the STANDBY to be able to send local outbound traffic to check for updates. The default route is not being used so something in the checkpoint firewall stuff is intercepting the local traffic and forwarding it to the ACTIVE member's SYNC interface IP. An 'ip route get <www.google.com IP>' shows that it should be forwarded to the default gateway and not over the SYNC interface.

Adam276 · ‎2020-10-19

I fixed the issue with 'fw ctl set int fwha_forw_packet_to_not_active 1'. I tried that once (had the same setting on other sites) and it didn't work so I assumed something else was problematic. Not sure why it didn't work the first time I tried it but it is working now.

It appears the forwarding of cluster member originating traffic to the active member is expected and normal behavior. The gateways hide behind the cluster IP. STANDBY sends STANDBY originating traffic through the SYNC network with it's own external IP as the source. The ACTIVE member NATs the source IP to look like it came from the external cluster IP. Return traffic goes back to the ACTIVE member and then gets forwarded to the STANDBY.

Jon_Moss · ‎2020-10-20

It does work, but I guess the question still remains as to why a cli hack is needed for what is basically a required function. Ah well, perhaps in R80.50 and beyond it will become a simple tick button..

Maarten_Sjouw · ‎2020-10-20

A simple tick button that has always been there for a VRRP cluster....

Regards, Maarten

Adam276 · ‎2020-10-31

If you found a tick box in the GUI for 'fw ctl set int fwha_forw_packet_to_not_active 1', please do share.

Adam276 · ‎2020-10-31

Interesingly, I just found sk93204 article that was last updated in May 2020 but showing it applies to R80.40 saying that you should not use the 'fwha_forw_packet_to_not_active 1' setting on R80.20+ for a 'Standyby drops' problem and not to do anything for that for the newer versions. It claims you should use 'fw ctl set int fwha_silent_standby_mode 1' for R80.20+ to solve a 'Peer drops' problem. I never set that and only just found that article. I am pretty sure that I had the 'Standby drops' problem though because the standby couldn't connect outbound to checkpoint. It seems like the 'fwha_forw_packet_to_not_active 1' is still needed from my experience.

Scott_Paisley · ‎2020-10-31

This behaviour changed in R80.40. In R80.30 the traffic is still sent between the gateways, but they both use the same interface. In R80.40 the standby uses the sync interface instead. This broke stuff for us (I have a ticket open). The standby gateway needs to access a website through a VPN tunnel. Works fine pre-R80.40

Since the upgrade, the standby forwards the packet over the sync interface, the active encrypts it, send it through the tunnel, gets the reply, decrypts it, then drops it because it says the interface is illegal...

dropped by vpn_encrypt_chain Reason: illegal interface group

rogo · ‎2020-11-01

The "illegal interface group" issue has been recently discovered and the fix is being integrated into JHF.
For now please ask the TAC to obtain the hotfix for PMTR-60663. As a temporary workaround, set fwha_cluster_hide_active_only to 0.

Ruan_Kotze · ‎2020-10-20

Hi Jon,

Something else that worked well for me in a couple of other instances (and is perhaps a bit easier to do) is to do NAT rules which forces a "no hide":

rogo · ‎2020-11-01

The challenges (and solutions) related to Standby's connections are described in sk169154, section "Standby member's connections"

A few points relevant to this thread:

Standby's connections are asymmetric since the replies are handled by Active. In other words, each member only sees the packets in a single direction.
In R80.20 and later versions, deep inspection and security are applied more aggressively by default, even on local connections. This causes the active member drop some standby's connections (because it doesn't see all packets).
The solution is making Standby's connection to be handled by the active member even on the outbound direction. Controlled by fwha_cluster_hide_active_only (enabled by default in R80.40)
Using manual NAT rules to prevent NAT on private member's IPs is not supported. If such rules are defined, it will cause connectivity issues in the new mode above. If it's required for some reason to disable cluster NAT, use one of the supported methods (sk31832, sk34180)
The parameter fwha_forw_packet_to_not_active has now no bearing in this scenario and should not be set to 1.

In general, connections from standby should work seamlessly in the latest versions. If there are issues:

- If the version is R80.40 or above, check that you don't have manual no NAT rules enabled.

- For earlier versions, enable the parameter fwha_cluster_hide_active_only (you may need to add an access policy rule if using management pre R80.40, as described in sk169154)

If still any problems, please contact me directly at mrogovin@checkpoint.com

Thanks,

Michael

Are you a member of CheckMates?

Connectivity issues from standby gateway after R80.10 -> R80.30 upgrade