Advice Needed on ISP Redundancy and VPN Stability ...

samir-brkic · ‎2024-09-04

Hello,

We are currently managing a Check Point cluster configured for High Availability (HA) with two members and are encountering an issue related to VPN stability. I would appreciate your advice on best practices to ensure optimal operation for our setup.

Current Configuration:

Data Centers: The Check Point nodes are deployed across two different data centers.
Sync and Internal Ports: These ports are connected through two separate switches, which are interlinked to ensure connectivity between the data centers.
External Ports: Each Check Point node has an external port connected directly to redundant ISP routers provided by a single ISP. The ISP manages failover on their end, and the ISP routers in each data center are interconnected to maintain redundancy.

Issue Description:

We are experiencing issues with Site-to-Site VPN connections dropping after a standby node reboot. Specifically, the Site-to-Site connections become non-functional, and we need to manually reset them using the command vpn tu with option "0" to re-establish the connections. This command serves as a workaround, but we are looking for a more permanent solution.

During our analysis, we considered that the issue might be related to the physical connection to the ISP routers. However, we could not find best practices for ISP redundancy in setups where multiple ISP routers are used within a single ISP's network. The official documentation primarily covers redundancy with two separate ISPs.

Any insights or recommendations you could provide regarding this issue would be greatly appreciated.

Thank you for your assistance!

Best regards,

Samir

Duane_Toler · ‎2024-09-04

Are you using the new "Active Active" mode with one node in each geographically-separated location and different subnets on the interfaces? If so, VPN blade isn't supported in this fashion.

If not, what mode of ClusterXL are you using?

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

samir-brkic · ‎2024-09-04

@Duane_Toler

Thank you for your feedback. As mentioned in the description, we are currently managing a Check Point cluster configured for High Availability (HA) with two members, so we are using Active-Standby mode, not Active-Active.

Thanks again for your assistance!

Duane_Toler · ‎2024-09-04

Can you make a quick diagram of your topology? Depending on how you have interfaces configured and connected, this may (or may not) be part of your issue. My suspicion is that you are losing path reachability, or peer adjacency, during a failover and you need to use something like BGP between your gateways and ISP routers. A quick diagram will help answer that with more certainty.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

samir-brkic · ‎2024-09-04

this is a quick diagramm of the cluster

Duane_Toler · ‎2024-09-04

This indicates your firewall external interfaces are directly-connected to their own ISP router. This will break ClusterXL across the external interfaces. Do you have a switch that is connecting the eth0 interfaces together, along with the router interior interfaces?

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

samir-brkic · ‎2024-09-04

This is correct, the firewalls are directly connected to the ISP Routers, and that is also what i wanted to know, a best practice for this kind of setup where we have no switch between the ISP Routers and the firewalls.

Duane_Toler · ‎2024-09-04

So this means you did not configure your eth0 as a "cluster" interface type? This is why you are having issues. Add a switch between the firewalls and routers so everything is on the same layer2 segment. One of your interface IPs will have to change as well as one of the ISP router interior interfaces. If you have 2 /30 segments, then your ISP needs to assign you a single /29 for all of them. Define eth0 as cluster type and give it a VIP.

If you're in need of BGP, then have the ISP routers peer with the cluster VIP, not each individual member. However, each of your cluster members will need to peer with each ISP router.

Important items to include with BGP:

1. BOTH cluster members MUST be the same router-id (the cluster is presenting as the RID)

2. BOTH cluster members MUST be the same BGP ASN (the cluster is presenting as the ASN)

3. BOTH cluster members MUST have the same routemap configuration (same routing policies)

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Timothy_Hall · ‎2024-09-04

Can you please be more specific about exactly when the VPN tunnel drops relative to the timing of the reboot of the standby? In other words, does the tunnel stop working on the active as soon as the standby is dropped and loses link to the switch (less likely), or does the tunnel die at about the time the standby fetches the latest policy and attempts to rejoin the cluster (more likely)?

If the former, that would suggest some kind of network issue, perhaps with STP on the switch or possibly routing table re-convergence. If the latter, I know at one point that a gateway joining a cluster after a reboot would fetch the policy directly from the active member. There was also some extra logic added recently to a cluster policy installation so I'm wondering if the "policy sync" operation when the rebooted member attempts to join may be disrupting your VPN. Questions:

0) IKEv1 or IKEv2? Is this an interoperable VPN between Check Point and some other third-party device, or a homogenous VPN?

1) I assume that reinstalling the policy (a full install, not an accelerated one see sk169096) to the cluster when both members are working does not cause a VPN disruption?

2) Is the checkbox keep_IKE_SAs set under Global Properties...Advanced...Configure...VPN Advanced Properties...VPN IKE Properties? If not all IKE SAs will be cleared upon policy installation, and the early termination of these IKE SAs has been known to hang tunnels, especially in an interoperable scenario. However the IPSEC SAs should be maintained so existing tunnel connectivity should continue to work for up to 60 minutes.

3) Any logs about the VPN failing? Invalid SA? No response from peer? Invalid ID?

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

samir-brkic · ‎2024-09-04

Thank you for your detailed response and questions. Here are the specifics based on your queries:

The Site-to-Site VPN stops working immediately, even if I run cpstop on the standby member, so it does not appear to be directly related to the cluster joining after a reboot.

To answer your questions:

We are using IKEv1. This is a Site-to-Site VPN between two Check Point clusters, both managed by the same management server.
No, installing the policy (even a full install) does not affect the VPN; the connections remain stable during policy installations.
Yes, the checkbox "keep_IKE_SAs" is checked under Global Properties > Advanced > Configure > VPN Advanced Properties > VPN IKE Properties.
Collecting logs has been challenging because these are both production environments. When the issue occurs, I usually need to apply the workaround immediately (vpn tu) to restore the connection, so I haven’t been able to capture logs during the failure. I will try to arrange a maintenance window, run VPN debug, and replicate the issue to gather more detailed logs.

I appreciate your assistance and will look into arranging the debug session to provide more insights.

Thanks again for your help!

D_TK · ‎2024-09-04

I don't have an answer, i just want to say i have the same issue and have for years across multiple versions. Environment is 8 locations, each with an active\standby cluster and all locations belong to the same "meshed" VPN community - all are currently r81.20 on the latest HFA. We've just built the vpn tu \ 0 into all maint operations that require either the transfer of active\standby roles, or the reboot of the standby.

A perfect example of this issue: Let's say I'm going to apply an HFA to the standby member - start the HFA, when that HFA install gets to ~16% complete, something triggers in the install process, probably a cpstop, which causes all the tunnels on the ACTIVE member to go down. An immediate vpn tu\0 brings those tunnels immediately back up.

It also happens often when we do simple role changes with the clusterXL_admin command.

So...if you ever get a resolution to this issue, please post back here.

thanks

samir-brkic · ‎2024-09-04

do you have a similar topology as i posted in the comments before? are your cluster members also directly connected to the ISP Routers?

D_TK · ‎2024-09-04

not really. In our case, both gateways are directly connected to the same L2 chassis switch. Our ISP delivers their fiber circuit via the LEC's fiber, so we have a ciena ethernet delivery switch on prem which is directly connected to the same L2 switch as the gateways. The ISP routers could be as far as 20-30 miles away from our facility.

Duane_Toler · ‎2024-09-04

These issues often are due to underlying layer2 configurations (usually STP issues) and/or lack of layer3 routing or peering with upstream devices. I've seen delayed routing convergence when using VRRP for clustering.

For layer2, you'll need to use either stacked switches (shared control planes) to avoid STP blocked ports. Interconnecting two switches is inviting trouble (STP). I never suggest this scenario (unless you have a LAN-based stacking such as Cisco StackWise Virtual or Arista's virtual MLAG).

For layer3, the 3rd party BGP peer should be peering with the cluster's VIP, not individual members. This WILL cause convergence delay. Peering with the VIP will ensure ClusterXL is synchronizing the FIB between members (CXL runs a FIBMGR process for this purpose).

During failover and routing re-convergence, the standby member first converts all FIB routes to "Kernel Remnant" routes (routing code K) until convergence is complete. Once convergence completes, the routes return to their original routing code for the appropriate routing protocol.

If you're using multicast mode, then you may run into issues with the peer layer3 devices that don't work with a unicast IP mapped to a multicast MAC. The default and preferred cluster mode is Unicast anyway for this reason.

Hopefully one of these items helps with your issues.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Timothy_Hall · ‎2024-09-04

Sounds a lot like the following, although this was supposed to be fixed a long time ago:

sk170055: Site to Site VPN outage on ClusterXL Active member when running "cpstop" on the Standby cl...

The SK has precious few details about the cause, but does mention sending of Tunnel Test packets. @D_TK do you have Permanent Tunnels set? If so it might be interesting to disable it and see if that changes the behavior.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

D_TK · ‎2024-09-04

Thanks Tim - that SK looks right on point regarding the instances when it happens during patching or rebooting the standby. I'll disable perm tunnels and push the next time we enter a patch window which will hopefully be soon as i'd lke to get take 79 installed before we hit lock down. thanks

Timothy_Hall · ‎2024-09-04

Let us know how it goes @D_TK. You may want your Check Point SE to take a look at that SK's hidden notes to ensure Permanent Tunnels is actually the culprit, what I posted was just an educated guess.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

samir-brkic · ‎2024-09-04

Thank you for pointing that out. Our cluster is currently running R81.20 with JHF Take 70, which is higher than the fix mentioned in SK170055. According to the SK, this issue was resolved in R81.20 starting from Take 43, so our environment should already include the fix.

We have had the "Permanent Tunnel" setting enabled in our configuration. I disabled it and tested again, but unfortunately, the VPN still goes down when running cpstop on the standby member. This behavior persists, so it doesn’t seem directly related to the issue described in the SK.

Thanks for your suggestion—this does seem like a similar issue, but based on the information and testing, it doesn’t appear to be the exact cause of our problem.

Appreciate your help

Timothy_Hall · ‎2024-09-04

As I said in my other reply, there may be hidden notes for that SK that may be helpful since this issue looks so similar. Your Check Point SE should be able to get access to those notes.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

Advice Needed on ISP Redundancy and VPN Stability in Check Point HA Setup