Gateway Cluster Failover procedure

Scott_Paisley · ‎2021-10-17

Hi

Years ago we had a 3rd party support vendor managing our checkpoint firewalls, and they used to failover traffic in a cluster by downing one of the members.

On a support call with checkpoint for some issue, we were advised by the checkpoint engineer not to do that, but to change the member priority in the management cluster object and install policy, so we have been using that process ever since.

On another support call recently with a new set of checkpoint engineers we have been told that process is not recommended for cluster failover.

What method do you all use, and is there a documented recommended process we should be following?

Thanks

Chris_Atkinson · ‎2021-10-17

Each has its merits depending on the scenario that you're working in, what's the context of the event warranting the failover?

CCSM R77/R80/ELITE

Scott_Paisley · ‎2021-10-17

We use it in all cases where we want traffic to use the other box in a cluster, either for maintenance or troubleshooting

Tal_Paz-Fridman · ‎2021-10-17

Hi

Do not know if this is what you were looking for, but the official documentation has a section on How to Initiate Cluster Failover:

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_ClusterXL_AdminGuide/Topics-...

It also leads to a Best Practice SK:

Best Practices - Manual fail-over in ClusterXL

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Scott_Paisley · ‎2021-10-17

Thanks

That matches what we heard from support engineers this week

I guess my other question now is were we the only people who changed priority and pushed policy as a means of adjusting traffic through the clusters?

Chris_Atkinson · ‎2021-10-17

Can depend on the resolver groups involved, in some organisations not everyone will have direct CLI access to a Firewall cluster member.

The other method also needs an extra step to return the node to a standby state as you would be aware.

CCSM R77/R80/ELITE

rrbranco · ‎2021-10-19

I guess that it would be interesting to take into consideration also the cluster type, like VRRP and/or ClusterXL for instance.

the_rock · ‎2021-10-17

I would agree 100% with what @Tal_Paz-Fridman said in this thread. I had been doing it same way for years and never an issue. Essentially, if you run clusterXL_admin down on current active, it will become standby and when you run clusterXL_admin up, it will still stay standby, so definitely in my opinion, safest way of doing a failover. Im not sure what different engineers told you, but I am positive this is recommended Check Point process.

Best,
Andy

Magnus-Holmberg · ‎2021-10-17

How is it if you using "switch to higher priority cluster member" on member recovery, if doing the clusterXL_admin up, i think it switches over at that time. (not 100% sure)

https://www.youtube.com/c/MagnusHolmberg-NetSec

Scott_Paisley · ‎2021-10-17

Thanks for all the replies. They prompted another question.

If I admin down 1 box in the cluster, what happens if the other box suffers a failure?

Perhaps I should have specified circumstances when traffic might be on the other member for some period of time?

the_rock · ‎2021-10-17

That is very unlikely scenario...I mean, what are the chances if you were using say ISP redundancy and you downed one link to test 2nd one and 2nd one went down right after? Thats literally less than 1% possibility...highly unlikely.

Best,
Andy

Arne_Boettger · ‎2021-10-19

Well, it depends... clusterXL_admin down creates a failed cluster resource, making the cluster member "less healthy" than the other.

Small failures like interfaces going down should not make it active again, but if the other fails completely (crashes or reboots), it still becomes active attention.

the_rock · ‎2021-10-19

I know some people do cpstop as well, but personally, Im not big fan of doing it that way, since it removes the currently installed policy. I would definitely follow what @Tal_Paz-Fridman suggested. Besides, it is vendor recommended.

Best,
Andy

Tal_Paz-Fridman · ‎2021-10-19

You can run cpstop with specific flags that maintain the policy and connection table:

sk113045 https://supportcenter.checkpoint.com/supportcenter/portal?action=portlets.SearchResultMainAction&eve...

[Expert@HostName:0]# cpstop -fwflag -proc

Running this command will stop Check Point daemons and Security Servers, while maintaining the active Security Policy running in the Check Point kernel. Rules with generic Allow/Reject/Drop actions, based on services, will continue to work.

Or if you want to load the Default Filter:

[Expert@HostName:0]# cpstop -fwflag -default

Running this command will stop Check Point daemons and Security Servers. The active Security Policy running in the Check Point kernel will be replaced with the Default Filter policy.

Also see:

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_CLI_ReferenceGuide/Topics-CL...

cpstop -fwflag -proc

Shuts down Check Point processes
Keeps the currently loaded kernel policy
Maintains the Connections table, so that after you run the cpstart command, you do not experience dropped packets because they are "out of state"

the_rock · ‎2021-10-19

Thanks for that, was not aware of the sk, but thats helpful!

Andy

Best,
Andy

Are you a member of CheckMates?

Gateway Cluster Failover procedure