VPN behaviour at Cluster failover (VPNs not operat...

nzmatto · ‎2021-09-01

I have a R80.10 cluster operating which I want to upgrade, however when we fail traffic onto the secondary node three business critical VPNs stop receiving traffic. The issue I am seeing does not appear on my R80.40 cluster, but I can't find any notes suggesting changes to the behaviour. I would like to understand what happens to site-to-site VPNs at a cluster failover.

Specifically, on the R80.40 cluster I noted the VPNs dropped and re-established when the cluster failed over. On the R80.10 cluster there were no such logs. Looking at the firewall it was like it just handed the traffic over from one to the other, but ultimately traffic was not making it to the destination host, and I was unable to determine if it was in face making it to the firewall, as I had a very limited outage window. All other traffic on all interfaces correctly failed over, including other (non-VPN) traffic on the same interface the VPN exits from worked perfectly. Only the VPN traffic was impacted.

When a cluster fails over should a VPN drop and re-establishment be expected?

If this does not happen should manually forcing them to drop through VPN TU option 9 (Delete all IPsec SAs for ALL peers and users) work?

Is there anything specific I can look at in the config to determine how the VPNs may behave at failover?

Are there any recommended commands for monitoring this during the failover?

My plan is to schedule another test / outage window but once again there will be strict limits on the time I have available for testing / roll-back so I need to be sure of everything I may need to in advance.

Thanks Matt

Vladimir · ‎2021-09-01

Please expand on what kind of cluster you are working with that is experiencing this issue.

In a ClusterXL HA, VPN states are maintained and the failover is a transparent event.

If you are not using VMAC and are relaying on G-ARP for failover or are using VRRP, you mileage may vary. Also, check if the State Synchronization is enabled for the cluster.

As per documentation:

"A High Availability Security Cluster ensures Security Gateway and VPN connection redundancy by providing transparent failover to a backup Security Gateway in the event of failure."

nzmatto · ‎2021-09-01

It is ClusterXL HA, and we are relying on G-ARP to notify the network. We attempted to use VMAC, but for whatever reason this caused a major issue on the internal side of the cluster. G-ARP seems to be working OK as evidenced by the fact all other traffic moves across.
State Synchronization is enabled for the cluster.
To initiate the failover I simply change the priority of the cluster members and push a policy update. The VPNs were failing on both the R80.40 and R80.10 clusters previously, however I diagnosed this as a G-ARP issue on the external switches. Once I resolved that the R80.40 one was fine. When I initiated the failover the logs showed a VPN re-establishment, and everything was perfect. On the R80.10 it looked like everything had just seamlessly handed over, as expected with a transparent failover, however the end application stopped receiving the data which was previously arriving via the VPN.

Vladimir · ‎2021-09-01

If the VPN integrity is maintained after the failover, but the application is not functioning, I'd check the routing on the standby cluster member to make sure that its configuration matching that of the active one.

nzmatto · ‎2021-09-01

Agreed, however all other traffic (non VPN) to the same internal device works perfectly. I have done a line by line comparison on the config of both devices and there are no unexpected differences.

Vladimir · ‎2021-09-01

What do you see happening with that traffic after the unsuccessful failover in the logs?

BTW: you have mentioned that your failover process was change of priority with policy load.

That's not really a cluster failover process. You really should be using "clusterXL_admin down" for it.

nzmatto · ‎2021-09-01

That's one of my issues....when I conducted the test I had a very limited change window, and I was expecting it to just work after fixing the G-ARP problem as it had on the alternate cluster. I did not have time to run through troubleshooting, but there were no traffic logs relating to traffic exiting the VPN while we were on the alternate node.
I have not used clusterXL_admin down before. I will look into this and use it next time.
I've just been talking to the client and it seems it's going to be a few weeks before I'll be able to get another change window to test this. The plan is to ask for a longer window so I have time to do some troubleshooting.

PhoneBoy · ‎2021-09-01

This is a multi-version cluster failover, which I suspect will act a bit differently than a regular cluster failover.
There are also some pretty significant differences between R80.10 and R80.40.
That said, assuming you're not talking about Traditional Mode VPN, it should work per: https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

In which case, it might be worth a TAC case before your next attempt.

Are you a member of CheckMates?

VPN behaviour at Cluster failover (VPNs not operational after failover - all other traffic fine)