cpharob state causes split-brain during policy pus...

Matt_Killeen · ‎2017-11-27

Hi all,

We recently experienced a problem where the third member of a three member cluster incorrectly saw the ACTIVE member as DEAD and the second member as down and wnet active causing an major outage on our network.

Whilst we found and rectified the issue, CheckPoint support have said that they were were unable to find the root cause for the cluster issue and cannot progress further unless they are able to troubleshoot while the cluster issue happens.

I thought I'd share the issue and resolution to add to the kinowledge base and see if anyone had experienced anything similar.

1. The Checkpoint cluster that had the issue has three cluster members and CheckPoint monitor reported all cluster members were OK.
2. Cluster members 1 and 2 both reported that the cluster cphaprob state as Active/Standby/Standby as expected.
3. Cluster member 3, however, reported that number 1 was ACTIVE, number 2 was down and 3 (itself) was Standby (copy of output below).

[Expert@fw-xxx-003:0]# cphaprob state

Cluster Mode: High Availability (Active Up) with IGMP Membership

Number Unique Address Assigned Load State

1 192.168.1.1 100% Active
2 192.168.1.10 0% Down
3 (local) 192.168.1.2 0% Standby

4. There was no indication from Smartdashboard, Smart Monitor or Tracker that there was any problem with the cluster or any cluster members.
5. There was nothing to indicate how long member 3 had been in this state where it thinks it is healthy but member number 2 is down but as the issue has occurred to greater or lesser extents on at least two previous policy installs, we estimate that it could have been at least two weeks.
6. What we found was that when a Policy was installed, Cluster members 1 and 2 behaved as expected but cluster member 3 reported that member 1 was "DEAD", thought member 2 was down and decided to become ACTIVE. As member 1 remained ACTIVE, this effectively left us with two cluster members in an ACTIVE state and both fighting to process traffic.
7. Although it wasn't evident in the logs of cluster members 1 and 2, a second policy install or a roll-back must have eventually allowed cluster member three to return to a Standby State and allow traffic to pass normally.
8. We left the cluster state as it was to give CheckPoint Support a chance to examine the issue and they agreed that the policy being installed was not the issue but it was the installation of any policy (no matter how minor the change) that would trigger the issue due to the Cluster member 3's "cpha prob state".
9. CheckPoint Support agreed and confirmed for us that this must be corrected and a policy install re-tested and, after restarting the service, "cphaprob state" now reported Active/Standby/Standby and a policy install was performed without issue or incident.

Here is the section of the logs when the last incident occurred showing the policy install and where member 3 reports member 1 as dead and tries to take over:

Nov 14 07:30:00 2017 fw-xxx-003 kernel: [fw4_0];FW-1: SIM (SecureXL Implementation Module) SecureXL device detected.
Nov 14 07:30:00 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 1 to 0
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];fwioctl: Policy has ended. Continuing extending dead timouts (fwha_cul_policy_done_time=141066855)
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 2 (time=141066855, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE)
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 1 to 0
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 0 to 1
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: user logged from admin
Nov 14 07:30:03 2017 fw-xxx-003 xpand[21398]: admin localhost t +volatile:clish:admin:23803 t
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: User admin logged in with ReadWrite permission
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: cmd by admin: Start executing : show interfaces ... (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: cmd by admin: Processing : show interfaces all (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: fwha_state_freeze: turning freeze type 0 ON (time=141066879, caller=fwha_set_conf)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: fwha_state_freeze: FREEZING state machine at STANDBY (time=141066879, caller=fwha_set_conf, freeze_type=0)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1]; Sync
Nov 14 07:30:03 2017 fw-xxx-003 xpand[21398]: admin localhost t -volatile:clish:admin:23803
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: User admin logged out from CLI shell
Nov 14 07:30:04 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 0 to 1
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.2
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.3
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.4
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.5
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.6
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.7

What we don't know:

How did cluster member three get into the state it was in?
Why did no PNOTES exist on any cluster member?
Why did it see member 2 as down?
Why did member 3 behave as it did which was to effectively become ACTIVE at the same time as an ACTIVE member causing a major disruption of all traffic flow.
Why was member 1 and member 2 unaware of the problem member 3 was having or report anything in their logs about member 3 trying to take over?.
What can be done to ensure this does not happen again?

Policy Install Procedures Moving Forward

As a result of the issue, we've implemented a procedure where the "cphaprob state" of each cluster member is checked before a Policy install is considered.

Has anyone experienced a similar issue or symptoms?

Did you find any way to auto-alert you that there was an issue with a cluster member?

PhoneBoy · ‎2017-11-27

Split-brain issues in general point to connectivity issues between the nodes.

Each cluster member comes to it's own conclusion about the state of the other members.

As part of that process, each member sends out it's own CCP packets on all clustered interfaces (which are multicast by default).

If responses to those packets aren't received for one reason or another, it's possible one cluster mode may come to a different conclusion about the state of other members (or itself).

I would check the interface stats on the gateways as well as the switches to see if there is any evidence of dropped packets.

Gaurav_Pandya · ‎2017-11-28

Hi,

I would say to check the logs in tracker for control connection. That will give brief idea about the issue.

You can watch for cluster flapping by going into Tracker, and filtering on the “Type” column for “Control” logs for particular time frame where issue occurred.

Nguyen_Manh_Hun · ‎2017-12-11

Hi Matt Killeen,

We experienced a same problem. When installing policy, the standby member didn't see active member (State DOWN) and it switched to Active. Syslog file show that we faced same problem like you.

CheckPoint Support joined and troubleshooted but no help. They suggest us 2 things:

1. Change the cluster CCP mode from multicast to broadcast.

2. Intall Jumbo hotfix version R77.30.

Hi Matt Killeen, have your gateways been installed any Jumbo hotfix.

Thank for your information.

Are you a member of CheckMates?

cpharob state causes split-brain during policy push