Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Nandhakumar_N
Participant

DR Scenario - ClusterXL Split Brain Mode

When we configure two firewalls in Cluster with Active / Standby mode using ClusterXL method. What are the scenarios will cause split brain mode?

Also, at any case will both nodes enter into Standby state?

 

0 Kudos
8 Replies
PhoneBoy
Admin
Admin

ClusterXL relies on Layer 2 connectivity between the cluster members on multiple interfaces.
If cluster members cannot reach each other on any interface, you can see "split brain" behavior.

I've never heard of both members entering standby state. 

0 Kudos
the_rock
MVP Diamond
MVP Diamond

If you send us output of below commands, may help more.

cphaprob state

cphaprob -a if

cphaprob -i list

cphjaprob -l list

cphaprob syncstat

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

ClusterXL member state is a bit more complicated than just Active/Standby. I discuss the more common ClusterXL member states in a post earlier this year.

It's not possible for all members of a cluster to be Standby. That state only happens if the cluster member is healthy, and there is at least one Active member.

It is possible for all members to be Down if they all think they have a critical problem. I mostly see this when people use a crossover cable for sync. Sending your sync traffic through a switch fixes that particular problem.

It's possible for multiple members to be Active if they lose all communications with the other members of the cluster. For example, this can happen if you connect them FW1---Switch1---Switch2---FW2 and you lose the link between Switch1 and Switch2. Since each member doesn't see any other members in the cluster, they may all try to go Active. Connecting your firewalls to multiple switches (e.g, bond0 to one switch set, and bond1 to a different switch set) with physically separate inter-switch links reduces the chance of this happening.

0 Kudos
Nandhakumar_N
Participant

When both members enter into active state, how the traffic inspection behavior will be? 

We are using crossover cable for sync. As if sync interface goes down or someone accidently removed sync cable, then will it create complete production down scenario? Why Standby node doesn't takes Active role instead of DOWN?

0 Kudos
Martijn
MVP
MVP

Hi,

Check Point has several checks to see if an appliance is capable to become a cluster member.

- Are the processes up?
- Are the interfaces up?
- Is there a policy installed?
- Is ClusterXL started?
- Is Sync interface up?

If you use a crossover cable for sync and this one is removed, nothing will happen is my experience. Both cluster member fail the Sync check, but all other checks are OK. Both member are equally degraded and cluster state (Active/Standby) is unchanged.

You could argue a fail-over can occur if the Standby member detects the Sync failure before the Active member does. But this will not result in a split-brain.

Martijn

0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

If both members go Active because inter-switch links have failed, they'll both work for the half of the network they can see which can't see the other member. For example, if you have web server 1, database server 1, and firewall 1 connected to switch 1, WS2, DB2, and FW2 connected to switch 2, and the link between switch 1 and switch 2 fails, both firewalls could conceivably pass their respective web-to-database traffic. Any networks which can see both firewalls will see the VIP flap back and forth between the member MACs, so traffic through such interfaces won't work. Situations like this have very narrow requirements, so they are extremely rare outside of labs. In all real failures I've seen, one member goes Down and the other goes either Active or Active Attention.

 

Don't use a crossover for sync. While it mostly works, it's not one of the Supported Topologies for Synchronization Network in the ClusterXL Admin Guide.

When using a crossover cable, if the cable fails or the interface on either side fails, both members now have a problem. Each one has to figure out if its problem is fatal, so they start probing on all their monitored interfaces (by default, this is any non-VLAN interface with IP addresses, and the highest and lowest VLAN IDs on any interface with VLANs). If a member finds a monitored interface where it doesn't get any responses, it may go down. If both members find such an interface, they may both go down.

It's fairly rare for an interface or cable to fail when both members are up, but rebooting one member while using a crossover for sync means the other member you're not rebooting sees a problem. The remaining member will start probing, and it can hit the problem I described. This doesn't always, but can result in rebooting one member causing the whole cluster to stop passing traffic.

0 Kudos
Martijn
MVP
MVP

Hi,

There is also a mechanism called local probing. When a cluster member does not receive CCP packets on a interface, it will start this mechanism bij sending ARP packets on that interface to see if the problem is local. See:

sk171844 - How to troubleshoot the Critical Device "Local Probing" in ClusterXL

If it detects a local problem, it will 'leave' the cluster. If the problem is not local it will remain in the cluster, but the state depends on a lot of other factors.

The ClusterXL technology has a lot of features to check the state of a cluster and I cannot remember seeing a split brain situation in all those years working with clusterXL.

Martijn 


0 Kudos
PhoneBoy
Admin
Admin

I know of at least one customer that encountered split-brain in the FW1---Switch1---Switch2---FW2 configuration @Bob_Zimmerman mentioned.
It was also on Nokia appliances, which should give you an idea of how long ago it was. 

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events