Customer scenario:
1 central cluster (2 members) and 7 remote gateways connected by Site-to-site VPN over MPLS lines.
Mgmt server is on the same network as central cluster.
Changes done:
Upgrade cluster hardware 5000 to 7000. R80.20 to R80.40
Customer changed from centrally managed license to local for the new hardware.
Step by step:
- Replaced passive node hardware (preconfigured, clusterXL activated and SIC set)
- In SmartConsole reset SIC for passive node, change version to R80.40, change HW to 7000
- Install access policy only against central cluster (unchecked fail if not possible to install on all members)
- Repeat steps 1-3 for active node but this time install against all gateways.
- New cluster is now OK but all other members failed because management failed to connect to the other nodes.
- Install threat prevention against central cluster.
- vpn tu shows all VPNs are down and we have no contact with the remote gateways.
- Logs show successful VPN traffic between the remote sites where traffic seemingly passes through the central cluster (origin and interfaces listed in logs). Does it boker the connections between other sites even if it hasn't got active VPNs to them itself?
After this I tried loads of different things but nothing would bring up communication with the other sites. Meanwhile logs suggest that traffic originating from networks behind central cluster successfully passed over the Site-to-site vpn to remote networks but this is just wrong since there still are no tunnels on the central cluster. Just confusing.
I did some checks with fw ctl zdebug drop and grep for some of the more active nodes connecting over the VPN normally and found drops like this:
dropped by fw_ipsec_encrypt_on_tunnel_instance Reason: No error - tunnel is not yet established;
Also collected all VPN debugs and cpinfo before we gave up and rolled back to the old hardware. We will open a case with checkpoint unless you guys can point out something we obviously did wrong.
Now, rolling back was interesting! As soon as we'd connect the old appliances, all the other gateways showed up in smartconsole (cpu info etc) and after we changed back versions and hardware, reset SIC etc all the tunnels were back up. WTF! Did the old gateways still carry IKE SAs from before they were disconnected or why would they connect everything so instantly even though SIC wasn't even completely reset at that point?
What am I missing?! We had no way of resetting the SAs on the remote gateways during the trouble shooting so I don't know if that would have solved anything. For next time we will have staff on the remote sites as well.
I haven't had a problem like this when upgrading clusters before. I have recently upgraded another CheckPoint cluster with some third party VPNs like Azure VPN and some Cisco ASA, which caused no trouble at all.
Here all of them even being Check Point gateways and the other nodes running R80.40 already, shouldn't this just have worked?
Would anything gone differently if I had installed the access policy against all gateways when replacing the passive node?
Just figured now while proof reading my rant; could it be the license? That the customer changed from centrally managed to local between the old and new hardware but the same cluster object?
Pointers welcome.
/ Ilmo