Dear Community!
We have a task where we have to replace a Cluster formed by 15600 gateways with 16200 gateways. One week ago, we already tried to replace the gateways in a maintenance window, but we had to roll back due to the following issue:
During the replacement, we followed this guide: https://community.checkpoint.com/t5/Security-Gateways/Replace-Upgrade-Cluster/m-p/69216
So we preconfigure the gateways with the old gw's configuration. The only change was in the interface names:
Old GW --> New GW
eth2-01 --> eth1-01
eth2-08 --> eth1-08 (it is forming a bond with the Sync interface)
eth3-01 --> eth2-01 (it is forming a bond with eth2-02)
eth3-02 --> eth2-02 (it is forming a bond with eth2-01)
Mgmt --> Mgmt (remained the same)
Sync --> Sync (remained the same, forming a bond with eth1-08)
LOM --> LOM
We disconnected all ports of the secondary site's old gw, and connected the secondary site's new gw. We reset the SIC, and changed the cluster member topology from (old) eth2-01 to eth1-01, the rest of the interfaces are in bond / VLANs so we can't change the actual interface behind it in SmartConsole. We also changed the Device Platform from 15000 to 16000 appliances. Until this step, everything worked fine, we could ping from the new GW to the Mgmt server or to the primary site's GW's Mgmt address. As the next step, we installed the access control policy, but it timed out for the new GW after like 10 minutes. By checking "cpstat fw" or "fw stat" in the new GW's CLI we saw, that the policy has successfully installed. So after the policy install, every connection stopped working on the new GW. We could not ping from the new GW to the Mgmt Server or the Primary site's GW even though it worked before the policy install. In addition, SIC connection changed to an "unknown" state (it was still trust established, but the connection did not work).
As the 15600 has 28 firewall instances in CoreXL and the 16200 has 43 FW instances, we decided to reduce the FW instances to 28 on the 16200 maybe this could be the issue, but unfortunately didn't solve the main problem.
When trying "fw unloadlocal" on the new GW, all connections started to work again, SIC connection status was "Communicating" etc. When installing the policy again, the same issue happened... no SIC connection, can't ping the Mgmt Server/Primary site (it is connected L2, with no routing in between). When we checked the zdebug, all we saw was "First packet isn't SYN" messages. There is no anti-spoofing prevention defined, only detection. We tried to reboot the new GW - with no success.
Unfortunately, we had to roll back at this step.
Does anybody has idea what should we check next time or what can be the issue here? We also have a TAC case opened.
Thanks in advance!
Richard