checkpoint cluster splitbrain issue

markovencelj · ‎2019-11-08

Helo guys.

I am searching some kind of official answer regarding Cluster HA (How to avoid split brain)

If we have cluster on 2 different locations in HA and L2 DarkFiber link between for all interfaces, vlans,sync, etc.

Cluster is in Active/Stdby mode, but customer have Datacenter in active/active mode stretched in both locations

Datacenters are configured to use Witness in azure, which is cloud based and in case of splitbrain (if we totally cut all links between locations), that witness decide which site became off and which remain active - reason is that we don't write data on both sites, because it may come to data corruption

My question is for CHKP cluster. What happens if we cut all interfaces between them?

Do they became active/active? in that case traffic goes through via both clusters which is not good.

How do you solve that kind of scenarios to avoid split brain situations.

BR

PhoneBoy · ‎2019-11-09

This has been disccused previously.
See: https://community.checkpoint.com/t5/General-Topics/Check-Point-Clustering-Query/td-p/21269
In short, ClusterXL is fairly robust at handling split brain situations.

markovencelj · ‎2019-11-11

Helo.

I already read that article and I am aware of clustering protocol but my question is:

What if we cut cable, which connect ALL interfaces in both cluster member.

Split brain will occur, question is: what will be status of clusters?

- active active

-active/down

- down/down

BR

G_W_Albrecht · ‎2019-11-11

This is covered in the linked article by Timothy_Hall:

--> In the case of an equal failure (i.e. a switch both members are attached to has its power cord pulled), they both report an identical failure to each other and nothing happens; whichever firewall that was previously active remains active as there is nothing to gain by attempting a failover.

So it should be active / standby as it had been previously. But i would strongly suggest to test the possible situations and verify behaviors in a maintenance window nevertheless...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Tommy_Forrest · ‎2019-11-11

ClusterXL may be fairly robust. But we've certainly had some issues with it.

We have a 3 way cluster where node 1 and 2 sit next to each other and node 3 sits down the street by about a mile and a half.

Shortly after migrating to 80.10 we had a couple of instances where the cluster became unresponsive. Upon inspection of the cluster status, we found each node thinking its peers were down and each node thinking it was active.

And of course, no traffic was flowing what-so-ever. A cpstop/start on each of the gateways fixed the issue.

Later on we would determine that the sync interface was being over run. Eventually, the issue would go away after a hotfix was taken. Since then, it hasn't been an issue.

Ideally, we'd move Sync to a 10G interface. But interface cards and SFP's are $$$$$. Maybe in a future project that'll be remedied.

Are you a member of CheckMates?

checkpoint cluster splitbrain issue