Re: Cluster HW upgrade caused all VPNs to fail

Ilmo_Anttonen · ‎2021-01-19

Customer scenario:

1 central cluster (2 members) and 7 remote gateways connected by Site-to-site VPN over MPLS lines.
Mgmt server is on the same network as central cluster.

Changes done:

Upgrade cluster hardware 5000 to 7000. R80.20 to R80.40
Customer changed from centrally managed license to local for the new hardware.

Step by step:

Replaced passive node hardware (preconfigured, clusterXL activated and SIC set)
In SmartConsole reset SIC for passive node, change version to R80.40, change HW to 7000
Install access policy only against central cluster (unchecked fail if not possible to install on all members)
Repeat steps 1-3 for active node but this time install against all gateways.
New cluster is now OK but all other members failed because management failed to connect to the other nodes.
Install threat prevention against central cluster.
vpn tu shows all VPNs are down and we have no contact with the remote gateways.
Logs show successful VPN traffic between the remote sites where traffic seemingly passes through the central cluster (origin and interfaces listed in logs). Does it boker the connections between other sites even if it hasn't got active VPNs to them itself?

After this I tried loads of different things but nothing would bring up communication with the other sites. Meanwhile logs suggest that traffic originating from networks behind central cluster successfully passed over the Site-to-site vpn to remote networks but this is just wrong since there still are no tunnels on the central cluster. Just confusing.

I did some checks with fw ctl zdebug drop and grep for some of the more active nodes connecting over the VPN normally and found drops like this:

dropped by fw_ipsec_encrypt_on_tunnel_instance Reason: No error - tunnel is not yet established;

Also collected all VPN debugs and cpinfo before we gave up and rolled back to the old hardware. We will open a case with checkpoint unless you guys can point out something we obviously did wrong.

Now, rolling back was interesting! As soon as we'd connect the old appliances, all the other gateways showed up in smartconsole (cpu info etc) and after we changed back versions and hardware, reset SIC etc all the tunnels were back up. WTF! Did the old gateways still carry IKE SAs from before they were disconnected or why would they connect everything so instantly even though SIC wasn't even completely reset at that point?

What am I missing?! We had no way of resetting the SAs on the remote gateways during the trouble shooting so I don't know if that would have solved anything. For next time we will have staff on the remote sites as well.

I haven't had a problem like this when upgrading clusters before. I have recently upgraded another CheckPoint cluster with some third party VPNs like Azure VPN and some Cisco ASA, which caused no trouble at all.

Here all of them even being Check Point gateways and the other nodes running R80.40 already, shouldn't this just have worked?

Would anything gone differently if I had installed the access policy against all gateways when replacing the passive node?

Just figured now while proof reading my rant; could it be the license? That the customer changed from centrally managed to local between the old and new hardware but the same cluster object?

Pointers welcome.

/ Ilmo

MSSP · ‎2021-10-25

Hi Ilmo_Anttonen,

We have planned to change an appliance 4200 with os R77.30 due to an interface problem and a strange behaviour to another new one 4200 with OS R80.40.

All gateway are running same version os (R80.40) and same hotfix (HF_Bundle_T91_sk165456)

Note :

The new one has a policy installed before in order to prepare the firewall (upgrade .. ect), the blade vpn were not activated.

Change done :

1 - Replaced the older hardware with the new one

2 - In SmartConsole reset SIC → ok

3 - Install access policy, change version to R80.40 (the same policy installed and used for the old hardware). → installation ok

4 - Internal flow are accepted between different interfaces.

5 - vpn tu and tunnel & user monitoring shows all vpns are down (5 vpn), one with a preshared key others with certification authentication. after the installation all vpn are in init state 5 seconds after change to state down.

6 - We changed the preshared key in the both policy, and renew certificate (remote gateway, and the new gateway) and install all policy for all remote getway → result the vpn still down.

7 - After this i tried loads of different things but nothing would bring up (change licence, add eval licence, remove the getway from smartconsol and create another one, disable ipsec blade and enable, remove the community and create a new one …...) --> still not working.

8 - log show that traffic rejects to the remote gateway (reject category : IKE failure).

9 - The new gateway is ok, and all members are also ok because contrary to you, all the management flows to other gateways do not go through the vpn

Rolling back :

1 - Replaced the new one with the older one.

2 - In SmartConsole reset SIC → ok

3 - Install access policy change the Version , the same policy → installation ok

5 - internal flow are accepted between different interfaces

4 - vpn tu shows all vpns are up without doing nothing.

I appreciate it if you share your solution or the checkpoint answer.

Thanks in advance.

Bachir chelbi.

Are you a member of CheckMates?

Cluster HW upgrade caused all VPNs to fail