Solved: Using HA MGMT in a virtual environment

Ari_who · ‎2025-10-05

Hi all,

OS: Gaia R81.20

Environment: Maestro + VSX

We have two management servers running as an active-passive HA, and both are VMs, running on a vCenter.

The vCenter servers are in two physically separate locations, one is a DataCenter, and one, a DR.

In case of failover to the DR, the entire vCenter will be available there including the Checkpoint Management server,

as it's being replicated all the time in a hot backup.

Since there's plenty redundancy through the vCenter, Is there any point in having also a secondary Management server in this case?

Or did I miss something...

Thanks in advance!

Chris_Atkinson · ‎2025-10-06

It entirely depends on the types of failures that you are attempting to guard against and what interdependencies / risks you choose to accept.

With VSX there is greater importance on the Management than other deployment scenarios.
VMotion is not supported by the Check Point Management platform.
Would different/additional teams be involved in any recovery efforts?
Are the machines currently in different IP subnets from a routing perspective?

CCSM R77/R80/ELITE

View solution in original post

the_rock · ‎2025-10-06

To me personally, but again its just my honest opinion, I would never bother with mgmt HA in such scenario, because if there is constant replication on vCentre side, you dont really have a need for another server.

Just my 2 cents.

Andy

Best,
Andy

View solution in original post

Vincent_Bacher · ‎2025-10-06

If I understand the question correctly, are you asking whether a second management server is required for each data center?
Then my answer is: No.
We also only have one MDS per data center.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

View solution in original post

Lesley · ‎2025-10-06

One thing to keep in mind is the CRL check. Default is 24 hours. This is for VPN tunnels only from Check Point towards other CP's on the same mgmt! If mgmt is down to long firewalls cannot do CRL check. (CRL check can be disabled, not secure).

HA mgmt could be handy, if you have frequent changes on the system. If system is allowed to be down couple hours I would not invest in HA mgmt.

-------
Please press "Accept as Solution" if my post solved it 🙂

View solution in original post

Chris_Atkinson · ‎2025-10-06

It entirely depends on the types of failures that you are attempting to guard against and what interdependencies / risks you choose to accept.

With VSX there is greater importance on the Management than other deployment scenarios.
VMotion is not supported by the Check Point Management platform.
Would different/additional teams be involved in any recovery efforts?
Are the machines currently in different IP subnets from a routing perspective?

CCSM R77/R80/ELITE

Ari_who · ‎2025-10-20

Thank you Chris!

1. Could I please trouble you to elaborate about why with VSX there's a greater importance for redundancy than in other environments?

2. We're using SRM, not vMotion, so I don't think that should be a problem.

3. There are a couple other teams such as DevOps and SysOps who will be involved in a Disaster Recovery

4. The machines are all in the same dedicated subnet.

Thanks again!

Bob_Zimmerman · ‎2025-10-20

Many types of changes on VSX systems must be done from the management server and pushed down to the firewall. This includes most changes to logical interfaces (building a new one, removing one, changing the VLAN, changing the IP or mask, etc.), and all changes to static routing. If your disaster involves things being unable to reach each other, fixing it could require interface or routing changes.

I've personally seen way too many situations where the VM environment broke catastrophically, and we needed to make extensive firewall changes to fix it. As a result, I don't trust stuff all under one hypervisor management system, including vCenter. Since admins have to go through my firewalls to get to the hypervisor management system, I run management HA in VMs managed by two totally separate hypervisor managements, and I personally consider anything less an existential threat to the environment.

the_rock · ‎2025-10-06

To me personally, but again its just my honest opinion, I would never bother with mgmt HA in such scenario, because if there is constant replication on vCentre side, you dont really have a need for another server.

Just my 2 cents.

Andy

Best,
Andy

Vincent_Bacher · ‎2025-10-06

Out of curiosity: What is MVP ?

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

the_rock · ‎2025-10-06

I know in sports it stands for most valuable player, but I believe in community context it means most valuable professional...I THINK : - )

Andy

Best,
Andy

Ari_who · ‎2025-10-20

Thank you!

Just wanted to see if I'm missing anything...

Vincent_Bacher · ‎2025-10-06

If I understand the question correctly, are you asking whether a second management server is required for each data center?
Then my answer is: No.
We also only have one MDS per data center.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

Ari_who · ‎2025-10-20

Not quite..

It's one domain. One Production datacenter, and one Backup datacenter.

There's full and constant replication between both sites.

The question is if we need one virtual MGMT server in each site, or is one at the production enough since is fully backed up at the DR site.

Lesley · ‎2025-10-06

One thing to keep in mind is the CRL check. Default is 24 hours. This is for VPN tunnels only from Check Point towards other CP's on the same mgmt! If mgmt is down to long firewalls cannot do CRL check. (CRL check can be disabled, not secure).

HA mgmt could be handy, if you have frequent changes on the system. If system is allowed to be down couple hours I would not invest in HA mgmt.

-------
Please press "Accept as Solution" if my post solved it 🙂

Ari_who · ‎2025-10-20

This is very interesting, and indeed I did not think about it.

Won't all CRL info be replicated to the backup machine?

If there's a loss of connectivity between the tunnel peers for 24 hours then we're in a pretty bad shape as it is...

Duane_Toler · ‎2025-10-20

Agree with @Bob_Zimmerman. You need to review your RPO/RTO policies. How long is Site A "down" before you declare "DR Event"? Then when "DR Event" is declared, how long will you need to get basic networking services online? How much is currently offline, and how long before basic network services are online? During the state change are any firewall/VPN/routing changes required?

Plus, having the HA mgmt in Site B allows you to do general maintenance on Site A without worries, or do your dry(-ish) run exercises. Just because a CRL check is "every 24 hours", keep in mind that the last CRL check was not "24 hours ago from this moment in time." The last CRL check was (for example) 14 hours ago. You only have 10 hours remaining before that next CRL check! Don't go with "ok, we got 24 hours; good enough". This is a common fallacy. Nope, you're already 14 or 19 or 23 hours into that last retry.

All of these tiny details are always overlooked when people do "DR planning". I see it ALL. THE. TIME. No one ever understands what the "D" in "DR" is.... until it happens. You need to plan on this with the expectation that your management server has vanished and is unrecoverable. Thanos just snapped it out of existence. Now what are you going to do? Has your SAN or SRM been Thanos-snapped, too? You need to plan as if an asymmetric 50% of your infrastructure disappeared.

Tabletop exercises are great, and each time you do, you need to use different Choose Your Own Adventure paths. I absolutely positively would not rely on vCenter to be your DR plan for the things that are responsible for your network and perimeter OAM services. vCenter requires ESX, and SAN, and iSCSI connectors via fabric connectors (be it Ethernet or FibreChannel or whatever). If you have an entire vCenter/ESX/SAN stack in Site B, that's fine. Just don't plan to "move Site A to Site B during DR event" (had a customer try that... they underestimated).

You said you had a "hot backup" and that is FANTASTIC! 👏👏 I always recommend having OAM things be hot in Site B. Even if you don't need a policy change, you will have your logs! You will have your visibility, and when everyone comes screaming at you about "The Firewall", you have logs to prove "nope, it isn't me." Even better, you will have a jump host to SSH to your firewalls in Site B. You have the BGP routes to the ISP and local LANs. You'll have your backup local VPN user, too. You have access, you have your things at the ready. Everyone is coming to you for those logs or to troubleshoot routing, VPN, etc. But you can play it cool, because you had a hot management server at the ready. 8) Let the server team scramble and fall over themselves trying to figure out why they had a misconfigured VLAN on vCenter. 🫣

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2025-10-20

What a grwat explanation @Duane_Toler

Best,
Andy

Bob_Zimmerman · ‎2025-10-20

For that matter, what even gets detected as an outage? I had one a while ago in which a SAN filled up and all the VMs acted like their drives had been pulled. They were still up and on the network! They could respond to ping. If you sent them a SYN, they would reply with a SYN-ACK. Even stuff in the RAMdisk image worked! Nothing else did, though.

Remote access to the environment depended on VMs which were stored on the SAN. The DR VPN boxes were pointed at the primary DC's authentication servers and wouldn't switch to its local authentication while they responded to ping. We had to get somebody physically into the datacenter to plug into some routers and add blackhole routes before any of the server or VM admins could log in at all to even start figuring out what was happening.

If the DR environment had been a full copy of production, it would have had a similar SAN which would have filled up at the same time (or very shortly after). Trying to bring up a copy of the management in the DR environment would have failed due to the nature of what had gone wrong.

Duane_Toler · ‎2025-10-20

Yes! Exactly! "What classifies as a DR event?". Lots of monitoring needed for all the little things, too. Sadly, we always miss something (sigh, "humans"). Indeed, what is DR-worthy and what is just an "inconvenience".

I can't quite recall the exact circumstances, (yes it involved "firewall" in some way) but I had a customer having some issue at one point and they started getting twitchy and asked me [just a vendor/consultant!] "do we need to declare DR?" Uh, as an "outsider" I never expected that I had to make that call on their behalf! I recall that we didn't do that, but it was certainly an experience. The issue got resolved and all was well. It certainly made me think that exact question... "what is indeed DR worthy?"

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Duane_Toler · ‎2025-10-21

Here's an SK that gives more reasons to why you want management HA if you're doing DR-type things:

https://support.checkpoint.com/results/sk/sk100731

Gateway tries to fetch the CRL from the first Security Management server that responds. By default only the IP address of the primary Security Management server is written in that file.

CRL fetching fails because the gateway tries to fetch CRL from the primary Security Management server that is down.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Are you a member of CheckMates?

Using HA MGMT in a virtual environment