Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Nandhakumar_N
Participant

DR Scenario - ClusterXL Split Brain Mode

When we configure two firewalls in Cluster with Active / Standby mode using ClusterXL method. What are the scenarios will cause split brain mode?

Also, at any case will both nodes enter into Standby state?

 

0 Kudos
10 Replies
PhoneBoy
Admin
Admin

ClusterXL relies on Layer 2 connectivity between the cluster members on multiple interfaces.
If cluster members cannot reach each other on any interface, you can see "split brain" behavior.

I've never heard of both members entering standby state. 

0 Kudos
the_rock
MVP Diamond
MVP Diamond

If you send us output of below commands, may help more.

cphaprob state

cphaprob -a if

cphaprob -i list

cphjaprob -l list

cphaprob syncstat

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

ClusterXL member state is a bit more complicated than just Active/Standby. I discuss the more common ClusterXL member states in a post earlier this year.

It's not possible for all members of a cluster to be Standby. That state only happens if the cluster member is healthy, and there is at least one Active member.

It is possible for all members to be Down if they all think they have a critical problem. I mostly see this when people use a crossover cable for sync. Sending your sync traffic through a switch fixes that particular problem.

It's possible for multiple members to be Active if they lose all communications with the other members of the cluster. For example, this can happen if you connect them FW1---Switch1---Switch2---FW2 and you lose the link between Switch1 and Switch2. Since each member doesn't see any other members in the cluster, they may all try to go Active. Connecting your firewalls to multiple switches (e.g, bond0 to one switch set, and bond1 to a different switch set) with physically separate inter-switch links reduces the chance of this happening.

0 Kudos
Nandhakumar_N
Participant

When both members enter into active state, how the traffic inspection behavior will be? 

We are using crossover cable for sync. As if sync interface goes down or someone accidently removed sync cable, then will it create complete production down scenario? Why Standby node doesn't takes Active role instead of DOWN?

0 Kudos
Martijn
MVP Silver
MVP Silver

Hi,

Check Point has several checks to see if an appliance is capable to become a cluster member.

- Are the processes up?
- Are the interfaces up?
- Is there a policy installed?
- Is ClusterXL started?
- Is Sync interface up?

If you use a crossover cable for sync and this one is removed, nothing will happen is my experience. Both cluster member fail the Sync check, but all other checks are OK. Both member are equally degraded and cluster state (Active/Standby) is unchanged.

You could argue a fail-over can occur if the Standby member detects the Sync failure before the Active member does. But this will not result in a split-brain.

Martijn

0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

If both members go Active because inter-switch links have failed, they'll both work for the half of the network they can see which can't see the other member. For example, if you have web server 1, database server 1, and firewall 1 connected to switch 1, WS2, DB2, and FW2 connected to switch 2, and the link between switch 1 and switch 2 fails, both firewalls could conceivably pass their respective web-to-database traffic. Any networks which can see both firewalls will see the VIP flap back and forth between the member MACs, so traffic through such interfaces won't work. Situations like this have very narrow requirements, so they are extremely rare outside of labs. In all real failures I've seen, one member goes Down and the other goes either Active or Active Attention.

 

Don't use a crossover for sync. While it mostly works, it's not one of the Supported Topologies for Synchronization Network in the ClusterXL Admin Guide.

When using a crossover cable, if the cable fails or the interface on either side fails, both members now have a problem. Each one has to figure out if its problem is fatal, so they start probing on all their monitored interfaces (by default, this is any non-VLAN interface with IP addresses, and the highest and lowest VLAN IDs on any interface with VLANs). If a member finds a monitored interface where it doesn't get any responses, it may go down. If both members find such an interface, they may both go down.

It's fairly rare for an interface or cable to fail when both members are up, but rebooting one member while using a crossover for sync means the other member you're not rebooting sees a problem. The remaining member will start probing, and it can hit the problem I described. This doesn't always, but can result in rebooting one member causing the whole cluster to stop passing traffic.

0 Kudos
Martijn
MVP Silver
MVP Silver

Hi,

There is also a mechanism called local probing. When a cluster member does not receive CCP packets on a interface, it will start this mechanism bij sending ARP packets on that interface to see if the problem is local. See:

sk171844 - How to troubleshoot the Critical Device "Local Probing" in ClusterXL

If it detects a local problem, it will 'leave' the cluster. If the problem is not local it will remain in the cluster, but the state depends on a lot of other factors.

The ClusterXL technology has a lot of features to check the state of a cluster and I cannot remember seeing a split brain situation in all those years working with clusterXL.

Martijn 


0 Kudos
PhoneBoy
Admin
Admin

I know of at least one customer that encountered split-brain in the FW1---Switch1---Switch2---FW2 configuration @Bob_Zimmerman mentioned.
It was also on Nokia appliances, which should give you an idea of how long ago it was. 

0 Kudos
Martijn
MVP Silver
MVP Silver

Wasn't that  VRRP in which the IPSO OS was responsible for Master/Slave status of the cluster member.

When running 'cphaprob stat' both member where active confusing a lot of TAC engineers 😉

0 Kudos
emmap
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

In an Active/Standby CXL cluster, the only possible way to get split brain is when there is no connectivity between cluster members at all, all interfaces are isolated. In which case, you have bigger problems than a split brain gateway cluster.

The cluster members don't just communicate over Sync, they are constantly broadcasting their cluster state to each other over all cluster interface using CCP packets. This means that even if the Sync connection is completely lost, the cluster will still not go split brain, because the members can see each other over the data interfaces. A cluster without sync will be Active Attention/Down. Both members will have a Sync failure problem note, but one member will remain Active so that traffic continues to flow.

There is no scenario where both members are in a Standby state. A cluster member will only enter a Standby state when it is fully healthy and it can see an Active cluster member. If a member is fully healthy and there is no Active member detected, it will set itself Active.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events