Solved: Re: Check my Logic - clustering when virtualized

Deki · ‎2025-03-10

It seems simple but I am curious to see what the veterans here prefer and why.

In the past we have ALWAYS deployed clusters when serving critical infrastructure. Provides fault tolerance in hardware as well as seamless patching. Also the reason why we only deploy in HA and not LS mode.

We have a large private cloud environment that will have a section segmented off, a network within a network. 30+ servers, approx 2K users accessing services. Nothing bandwidth intensive but availability is critical. This will be duplicated across two DCs for production and hot-standby. In the past this would have meant two clusters and 4 gateways total but as we are virtualizing everything on top of VMware ESXi the hardware redundancy aspect becomes moot. At that point I am buying licensing for additional cores just to be able to patch without a maintenance window.

What am I missing? Are there other considerations or issues you've run up against with vSEC?

Thank you

Bob_Zimmerman · ‎2025-03-14

I wouldn't say clustering is just for patching. It's also for immediate fault tolerance. With a single firewall VM, if the host running it tanks for whatever reason, the vCenter can start it on another host, but it generally takes about 90 seconds before the VM would be able to pass traffic. Potentially much longer if it's trying to start other VMs at the same time (storage contention is painful).

If you can tolerate an outage like that, cool. If you can't, you would need either VMware Fault Tolerance (which limits you to two vCPUs or an unbelievably expensive license) or a cluster of VMs with DRS rules to ensure they're physically distant.

View solution in original post

Jeff_Engel · ‎2025-03-14

Apologies for the slow reply! Hardware redundancy is moot because you plan to install both cluster members on the same ESXi host?

Bob_Zimmerman · ‎2025-03-14

I wouldn't say clustering is just for patching. It's also for immediate fault tolerance. With a single firewall VM, if the host running it tanks for whatever reason, the vCenter can start it on another host, but it generally takes about 90 seconds before the VM would be able to pass traffic. Potentially much longer if it's trying to start other VMs at the same time (storage contention is painful).

If you can tolerate an outage like that, cool. If you can't, you would need either VMware Fault Tolerance (which limits you to two vCPUs or an unbelievably expensive license) or a cluster of VMs with DRS rules to ensure they're physically distant.

Jeff_Engel · ‎2025-03-14

Agreed. We also support the cluster members being on separate ESXi hosts which would provide for the traditional hardware redundancy.

I asked the somewhat leading question in case I was missing something in the design/architecture.

Bob_Zimmerman · ‎2025-03-14

It's also worth noting that even things like VMware Fault Tolerance can't protect you against certain faults. One time, a coworker removed a LUN which he was sure was not being used from the SAN, and pinkscreened 19 of the 20 ESXi hosts in the cluster. VMware Fault Tolerance has certain requirements (namely, all the hosts must be able to access the same storage) which mean it can't defend against that kind of problem. Only two totally separate VMs on totally separate storage systems would protect against that.

Incidentally, the vCenter tried to start up all of the VMs from the other 19 hosts on the one remaining host all at the same time. Something like 600 VMs were all fighting for RAM and storage access. Even SSDs grind to a halt under that kind of contention.

Jeff_Engel · ‎2025-03-14

Amen

the_rock · ‎2025-03-14

Bob described it perfectly.

Andy

Best,
Andy

Are you a member of CheckMates?

Check my Logic - clustering when virtualized