After installing the Policy in the replaced Cluste...

zsigmondrichar · ‎2024-05-23

Dear Community!

We have a task where we have to replace a Cluster formed by 15600 gateways with 16200 gateways. One week ago, we already tried to replace the gateways in a maintenance window, but we had to roll back due to the following issue:

During the replacement, we followed this guide: https://community.checkpoint.com/t5/Security-Gateways/Replace-Upgrade-Cluster/m-p/69216

So we preconfigure the gateways with the old gw's configuration. The only change was in the interface names:

Old GW --> New GW
eth2-01 --> eth1-01
eth2-08 --> eth1-08 (it is forming a bond with the Sync interface)
eth3-01 --> eth2-01 (it is forming a bond with eth2-02)
eth3-02 --> eth2-02 (it is forming a bond with eth2-01)
Mgmt --> Mgmt (remained the same)
Sync --> Sync (remained the same, forming a bond with eth1-08)
LOM --> LOM

We disconnected all ports of the secondary site's old gw, and connected the secondary site's new gw. We reset the SIC, and changed the cluster member topology from (old) eth2-01 to eth1-01, the rest of the interfaces are in bond / VLANs so we can't change the actual interface behind it in SmartConsole. We also changed the Device Platform from 15000 to 16000 appliances. Until this step, everything worked fine, we could ping from the new GW to the Mgmt server or to the primary site's GW's Mgmt address. As the next step, we installed the access control policy, but it timed out for the new GW after like 10 minutes. By checking "cpstat fw" or "fw stat" in the new GW's CLI we saw, that the policy has successfully installed. So after the policy install, every connection stopped working on the new GW. We could not ping from the new GW to the Mgmt Server or the Primary site's GW even though it worked before the policy install. In addition, SIC connection changed to an "unknown" state (it was still trust established, but the connection did not work).

As the 15600 has 28 firewall instances in CoreXL and the 16200 has 43 FW instances, we decided to reduce the FW instances to 28 on the 16200 maybe this could be the issue, but unfortunately didn't solve the main problem.

When trying "fw unloadlocal" on the new GW, all connections started to work again, SIC connection status was "Communicating" etc. When installing the policy again, the same issue happened... no SIC connection, can't ping the Mgmt Server/Primary site (it is connected L2, with no routing in between). When we checked the zdebug, all we saw was "First packet isn't SYN" messages. There is no anti-spoofing prevention defined, only detection. We tried to reboot the new GW - with no success.

Unfortunately, we had to roll back at this step.

Does anybody has idea what should we check next time or what can be the issue here? We also have a TAC case opened.

Thanks in advance!

Richard

Duane_Toler · ‎2024-05-23

This almost smells like a bad ARP cache for your cluster VIP. The ARP cache on your neighbors is usually held for about 4 hours on most L3 switches. When you did "fw unloadlocal", this reverted the gateways to sending packets via their native interfaces versus the cluster VIP. I would start there.

If "fw ctl zdebug -m fw + drop" didn't show packets dropped due to rulebase or anti-spoofing, then I would also check "fw monitor" to see if the gateways can emit packets to the interface (Big O). Check "tcpdump -nni <interface>" as well.

Double-check "cphaprob -a if" to make sure the cluster interfaces match the gateway interfaces; with Bond and VLANs, I would expect this to be ok.

If you see packets being emitted, then check your L3 neighbors ARP cache to see what bindings they have for your VIP. I would first check to make sure, then do "clear ip arp", including any VRF if required.

Do your gateways and management have local LAN access between them, or does your management go through a VPN or NAT before reaching the gateways? If your previous gateways had a static $FWDIR/conf/masters file, then be sure you replicate that on the new gateways. This is needed if your management is in public cloud, especially.

I wouldn't fret about "first packet isn't SYN" right now. That's likely a red herring at this stage.

zsigmondrichar · ‎2024-05-24

Thanks for the answer!

During the gateway replacement, we had a TAC session and he did check "fw monitor". We see that the traffic reaches the replaced FW on the bond0 (bond for the Sync connection) with chain "i", "I" and "o" and the last chain "O" was dropped by this message: "fw_first_packet_state_checks Reason: ICMP reply does not match a previous request." & "Frist Packet Isn't SYN". There is only one Layer 2 switch in between the Sync interfaces. But I think this can be misleading because as far as I know, when a Sync connection is in bond, only one of the bond members is used at a time and I think we checked the interface which was not used at that time.

The output of "cphaprob -a if" was correct on the new GW.

The gateways and the management have local LAN access between them, so it doesn't go through any VPN or NAT before reaching the gateways.

During the replacement we only changed the standby member (as the first step), the active member remained the old 15600. So the cluster VIP should remain the same, is it correct? We didn't touch the primary site in this step.

What I forgot to mention is that rarely we see that the ping was working between the new GW and old GW / Mgmt. So let's say, out of 200 ping probes, 5 were working. Since it is a connected interface without routing in between, I don't think that asymmetric routing can be the issue here.

G_W_Albrecht · ‎2024-05-23

I doubt that a cluster with two rather different GWs works at all ! Ist ClusterXL working, i would assume that when a 15600 and 16200 are clustered it will not.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

zsigmondrichar · ‎2024-05-24

Thanks for the answer!

Well, I can accept that a 15600 and 16200 can not form a fully working cluster, and we can not achieve a full sync between them, but L3 connection should work between them I think.

You are saying that we should try to disconnect the primary site's GW even though the replaced GW is not reachable after the policy install?

Duane_Toler · ‎2024-05-25

Reading this again, a little more closely, I would schedule a new maintenance window. This time, disconnect both of the 16k gateways and use either one, or both, of the new gateways. After you disconnect the gateway interfaces, check your management server ARP cache, too, just to make sure that is properly expired. Run "arp -d <ip>" to delete each of the host IPs and the VIP IP.

I presume you've checked, but make sure "cphaprob stat" shows the cluster forming. For the Sync interface, in"fw monitor", you shouldn't be seeing any TCP-related messages; CCP is UDP 8116. If you see something TCP-related, then some cable is connected to the wrong port on the switch. I presume the ICMP message is because you were trying to ping the Sync peer IP?

You can check "cphaprob syncstat" to see if synchronization is taking place correctly. TAC may have checked this, tho.

If at all possible, can you a lab test of these gateways? Move them, or their switch port connections, to new VLANs and test them in a "clean room" environment to make sure they function as you intend before taking down your network for trial an error. If you were able to install policy (as you saw in "fw stat"), then they should have a working policy (theoretically) and form a cluster. Then you can do some basic sanity tests. If possible, move your management to the new VLAN as well to test that connectivity.

Good luck on your next attempt!

G_W_Albrecht · ‎2024-05-27

Do you have an open SR# with CP TAC for this issue ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

zsigmondrichar · ‎2024-05-27

Yes, we have an open SR. Do you need the number?

Are you a member of CheckMates?

After installing the Policy in the replaced Cluster Member, all traffic were dropped.