Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Ari_who
Explorer
Jump to solution

Using HA MGMT in a virtual environment

Hi all,

 

OS: Gaia R81.20

Environment: Maestro + VSX 

 

We have two management servers running as an active-passive HA, and both are VMs, running on a vCenter.

The vCenter servers are in two physically separate locations, one is a DataCenter, and one, a DR.

In case of failover to the DR, the entire vCenter will be available there including the Checkpoint Management server, 

as it's being replicated all the time in a hot backup.

Since there's plenty redundancy through the vCenter, Is there any point in having also a secondary Management server in this case?

Or did I miss something...

 

Thanks in advance!

0 Kudos
4 Solutions

Accepted Solutions
Chris_Atkinson
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

It entirely depends on the types of failures that you are attempting to guard against and what interdependencies / risks you choose to accept.

  • With VSX there is greater importance on the Management than other deployment scenarios.
  • VMotion is not supported by the Check Point Management platform.
  • Would different/additional teams be involved in any recovery efforts?
  • Are the machines currently in different IP subnets from a routing perspective?   

 

 

 

CCSM R77/R80/ELITE

View solution in original post

0 Kudos
the_rock
MVP Gold
MVP Gold

To me personally, but again its just my honest opinion, I would never bother with mgmt HA in such scenario, because if there is constant replication on vCentre side, you dont really have a need for another server.

Just my 2 cents.

Andy

Best,
Andy

View solution in original post

0 Kudos
Vincent_Bacher
Advisor
Advisor

If I understand the question correctly, are you asking whether a second management server is required for each data center?
Then my answer is: No.
We also only have one MDS per data center.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

View solution in original post

Lesley
MVP Gold
MVP Gold

One thing to keep in mind is the CRL check. Default is 24 hours. This is for VPN tunnels only from Check Point towards other CP's on the same mgmt! If mgmt is down to long firewalls cannot do CRL check. (CRL check can be disabled, not secure). 

HA mgmt could be handy, if you have frequent changes on the system. If system is allowed to be down couple hours I would not invest in HA mgmt. 

-------
Please press "Accept as Solution" if my post solved it 🙂

View solution in original post

0 Kudos
16 Replies
Chris_Atkinson
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

It entirely depends on the types of failures that you are attempting to guard against and what interdependencies / risks you choose to accept.

  • With VSX there is greater importance on the Management than other deployment scenarios.
  • VMotion is not supported by the Check Point Management platform.
  • Would different/additional teams be involved in any recovery efforts?
  • Are the machines currently in different IP subnets from a routing perspective?   

 

 

 

CCSM R77/R80/ELITE
0 Kudos
Ari_who
Explorer

Thank you Chris!

1. Could I please trouble you to elaborate about why with VSX there's a greater importance for redundancy than in other environments? 

2. We're using SRM, not vMotion, so I don't think that should be a problem.

3. There are a couple other teams such as DevOps and SysOps who will be involved in a Disaster Recovery

4. The machines are all in the same dedicated subnet.

 

Thanks again!

0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

Many types of changes on VSX systems must be done from the management server and pushed down to the firewall. This includes most changes to logical interfaces (building a new one, removing one, changing the VLAN, changing the IP or mask, etc.), and all changes to static routing. If your disaster involves things being unable to reach each other, fixing it could require interface or routing changes.

I've personally seen way too many situations where the VM environment broke catastrophically, and we needed to make extensive firewall changes to fix it. As a result, I don't trust stuff all under one hypervisor management system, including vCenter. Since admins have to go through my firewalls to get to the hypervisor management system, I run management HA in VMs managed by two totally separate hypervisor managements, and I personally consider anything less an existential threat to the environment.

the_rock
MVP Gold
MVP Gold

To me personally, but again its just my honest opinion, I would never bother with mgmt HA in such scenario, because if there is constant replication on vCentre side, you dont really have a need for another server.

Just my 2 cents.

Andy

Best,
Andy
0 Kudos
Vincent_Bacher
Advisor
Advisor

Out of curiosity: What is MVP ?

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
0 Kudos
the_rock
MVP Gold
MVP Gold

I know in sports it stands for most valuable player, but I believe in community context it means most valuable professional...I THINK : - )

Andy

Best,
Andy
0 Kudos
Ari_who
Explorer

Thank you!

Just wanted to see if I'm missing anything...

0 Kudos
Vincent_Bacher
Advisor
Advisor

If I understand the question correctly, are you asking whether a second management server is required for each data center?
Then my answer is: No.
We also only have one MDS per data center.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
Ari_who
Explorer

Not quite..

It's one domain. One Production datacenter, and one Backup datacenter.

There's full and constant replication between both sites.

The question is if we need one virtual MGMT server in each site, or is one at the production enough since is fully backed up at the DR site.

0 Kudos
Lesley
MVP Gold
MVP Gold

One thing to keep in mind is the CRL check. Default is 24 hours. This is for VPN tunnels only from Check Point towards other CP's on the same mgmt! If mgmt is down to long firewalls cannot do CRL check. (CRL check can be disabled, not secure). 

HA mgmt could be handy, if you have frequent changes on the system. If system is allowed to be down couple hours I would not invest in HA mgmt. 

-------
Please press "Accept as Solution" if my post solved it 🙂
0 Kudos
Ari_who
Explorer

This is very interesting, and indeed I did not think about it.

Won't all CRL info be replicated to the backup machine?

If there's a loss of connectivity between the tunnel peers for 24 hours then we're in a pretty bad shape as it is...

0 Kudos
Duane_Toler
MVP Silver
MVP Silver

Agree with @Bob_Zimmerman.  You need to review your RPO/RTO policies.  How long is Site A "down" before you declare "DR Event"?  Then when "DR Event" is declared, how long will you need to get basic networking services online?  How much is currently offline, and how long before basic network services are online?  During the state change are any firewall/VPN/routing changes required?

Plus, having the HA mgmt in Site B allows you to do general maintenance on Site A without worries, or do your dry(-ish) run exercises. Just because a CRL check is "every 24 hours", keep in mind that the last CRL check was not "24 hours ago from this moment in time."  The last CRL check was (for example) 14 hours ago.  You only have 10 hours remaining before that next CRL check!  Don't go with "ok, we got 24 hours; good enough".  This is a common fallacy.  Nope, you're already 14 or 19 or 23 hours into that last retry.

All of these tiny details are always overlooked when people do "DR planning".  I see it ALL. THE. TIME.  No one ever understands what the "D" in "DR" is.... until it happens.  You need to plan on this with the expectation that your management server has vanished and is unrecoverable.  Thanos just snapped it out of existence.  Now what are you going to do?  Has your SAN or SRM been Thanos-snapped, too?  You need to plan as if an asymmetric 50% of your infrastructure disappeared.

Tabletop exercises are great, and each time you do, you need to use different Choose Your Own Adventure paths.  I absolutely positively would not rely on vCenter to be your DR plan for the things that are responsible for your network and perimeter OAM services.  vCenter requires ESX, and SAN, and iSCSI connectors via fabric connectors (be it Ethernet or FibreChannel or whatever).  If you have an entire vCenter/ESX/SAN stack in Site B, that's fine.   Just don't plan to "move Site A to Site B during DR event" (had a customer try that... they underestimated).

You said you had a "hot backup" and that is FANTASTIC! 👏👏  I always recommend having OAM things be hot in Site B.  Even if you don't need a policy change, you will have your logs!  You will have your visibility, and when everyone comes screaming at you about "The Firewall", you have logs to prove "nope, it isn't me."  Even better, you will have a jump host to SSH to your firewalls in Site B. You have the BGP routes to the ISP and local LANs.  You'll have your backup local VPN user, too. You have access, you have your things at the ready.  Everyone is coming to you for those logs or to troubleshoot routing, VPN, etc.  But you can play it cool, because you had a hot management server at the ready.  8)  Let the server team scramble and fall over themselves trying to figure out why they had a misconfigured VLAN on vCenter. 🫣

 

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
the_rock
MVP Gold
MVP Gold

What a grwat explanation @Duane_Toler 

Best,
Andy
0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

For that matter, what even gets detected as an outage? I had one a while ago in which a SAN filled up and all the VMs acted like their drives had been pulled. They were still up and on the network! They could respond to ping. If you sent them a SYN, they would reply with a SYN-ACK. Even stuff in the RAMdisk image worked! Nothing else did, though.

Remote access to the environment depended on VMs which were stored on the SAN. The DR VPN boxes were pointed at the primary DC's authentication servers and wouldn't switch to its local authentication while they responded to ping. We had to get somebody physically into the datacenter to plug into some routers and add blackhole routes before any of the server or VM admins could log in at all to even start figuring out what was happening.

If the DR environment had been a full copy of production, it would have had a similar SAN which would have filled up at the same time (or very shortly after). Trying to bring up a copy of the management in the DR environment would have failed due to the nature of what had gone wrong.

0 Kudos
Duane_Toler
MVP Silver
MVP Silver

Yes! Exactly!  "What classifies as a DR event?".   Lots of monitoring needed for all the little things, too.  Sadly, we always miss something (sigh, "humans").  Indeed, what is DR-worthy and what is just an "inconvenience".

I can't quite recall the exact circumstances, (yes it involved "firewall" in some way) but I had a customer having some issue at one point and they started getting twitchy and asked me [just a vendor/consultant!] "do we need to declare DR?"   Uh, as an "outsider" I never expected that I had to make that call on their behalf!   I recall that we didn't do that, but it was certainly an experience.   The issue got resolved and all was well. It certainly made me think that exact question... "what is indeed DR worthy?"

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
Duane_Toler
MVP Silver
MVP Silver

Here's an SK that gives more reasons to why you want management HA if you're doing DR-type things:

https://support.checkpoint.com/results/sk/sk100731

 
Gateway tries to fetch the CRL from the first Security Management server that responds. By default only the IP address of the primary Security Management server is written in that file.

CRL fetching fails because the gateway tries to fetch CRL from the primary Security Management server that is down.

 

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events