Re: HA with VSX cluster

DR_74 · ‎2024-11-08

Hello,

We have a Cluster with VSX gateway. Let's say we run in HA mode (not VSLS). So, all VS are active on FW1.

I would like to know if in this architecture we can face a Split Brain scenario?

If we loose sw1 or sw2 => we loose the SYNC interface ==> does it mean all FW will become active? ==> SPLIT BRAIN

If this is the expected behaviour if we loose the switch 1 or 2, what can be done to avoid the split brain ?

- Is VSLS an option?

- Moving SYNC on sw3 and sw4?

- A bond for SYNC (if possible) (linked to sw1/sw3 and sw2/sw4)?

Thank you

Chris_Atkinson · ‎2024-11-08

HA is now considered a subset of VSLS

On the surface simply moving Sync to sw3 & sw4 isn't helpful.

Sync bond is the most resilient option where available.

CCSM R77/R80/ELITE

DR_74 · ‎2024-11-08

Hello Chris,

So the design as it is, is susceptible to Split Brain when we loose the Sync interface? Correct?

Bob_Zimmerman · ‎2024-11-08

You would also have to lose the interconnectivity for all of the monitored interfaces (by default, the highest and lowest VLAN on each physical interface in each VS. The cluster heartbeats include member status and interface information on all interfaces. Essentially, the cluster members would need to be unable to reach each other on any interface to reliably cause both members to go active.

DR_74 · ‎2024-11-08

So in case sw1 is down, this means that the cluster memebrs should be able to see each other via their other interface, and so no split brain?

Duane_Toler · ‎2024-11-11

A few points:

1) As PhoneBoy said, VSX R81.10 and up only use VSLS

2) You can still assign all VSes to a single gateway if you really want it that way

3) Keep in mind which networking layer you're asking about to predict what state will occur.

If you lose layer1 on FW1, via outage of SW1, that cluster member will know it is dead and unable to function, so it will fail itself. The cluster protocol will monitor all of these interfaces. FW2 will go to "ACTIVE Attention" (meaning its active, but its peer is not). Likewise for SW2 and FW2.

If you lose layer2 in some fashion on SW1 (because link between SW1 and SW2 died, AND link between SW1 and SW3 died, AND link between SW3 and SW4 died, ... OR someone makes a misconfiguration and breaks spanning-tree), then you might have a split-brain function, but probably not. The cluster protocol monitors more than just interface status and its peer, so each cluster member will be able to make a reasonable determination if it can or can't pass traffic. If not, it will fail itself (also because RouteD will lose routes in the FIB; RouteD is a monitored operation). One member will always remain active if it loses its peer. The standby peer won't go active if it also can't monitor other hosts on the interfaces (this is the Interface Active Check operation).

The only way to really end up with a split brain function is disconnect the SW1-SW2 link, and SW3-SW4 link, but depending on where your next-hop gateways are for each network, or any directly-attached hosts, this still might not happen.

Although not directly configurable in SmartConsole, VSX clusters also use a priority-based member status to control which one becomes active. If you have a path from FW1 to FW2 in any sort of way, such that FW1 and FW2 can see at least one interface of its peer in some manner, they will know if one or the other is in a workable state. A split-brain function is incredibly hard to encounter unless you're trying to do something intentionally in a lab.

Separately, you have a port-channel spread across 2 distinct switches, but you haven't indicated if these are using a shared control plane (virtual stacking, or whatever). If they don't have a shared control plane, then by definition you can't use LACP. If you're using active-backup as your bond, then this will work but sub-optimally (especially if SW1 is the active member interface). If you are using bond mode active-backup, this will also feed into the cluster state to determine if it's workable or not.

Not sure if this diagram is an academic exercise or what, but it's a very bad installation to have. If you're just giving us all a big proficiency test, then kudos! 😁 Otherwise, you should make some plans revise this to be more suitable for a cluster configuration and increase your resiliency.

PhoneBoy · ‎2024-11-08

From R81.10, new VSX clusters must use VSLS.
I assume VSX is similar to traditional ClusterXL where bonded sync links are generally considered best practice.

AkosBakos · ‎2024-11-09

Hi @DR_74

Sync Redundancy
The use of more than one physical synchronization interface (1st sync, 2nd sync, 3rd sync) for synchronization redundancy is not supported. For synchronization redundancy, you can use bond interfaces.

Here is the guide:

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_ClusterXL_AdminGuide/Topics-...

One physical link for SYNC is not enough from my point of view nowadays.

Akos

----------------
\m/_(>_<)_\m/

Are you a member of CheckMates?

HA with VSX cluster