Hi all,
We recently experienced a problem where the third member of a three member cluster incorrectly saw the ACTIVE member as DEAD and the second member as down and wnet active causing an major outage on our network.
Whilst we found and rectified the issue, CheckPoint support have said that they were were unable to find the root cause for the cluster issue and cannot progress further unless they are able to troubleshoot while the cluster issue happens.
I thought I'd share the issue and resolution to add to the kinowledge base and see if anyone had experienced anything similar.
1. The Checkpoint cluster that had the issue has three cluster members and CheckPoint monitor reported all cluster members were OK.
2. Cluster members 1 and 2 both reported that the cluster cphaprob state as Active/Standby/Standby as expected.
3. Cluster member 3, however, reported that number 1 was ACTIVE, number 2 was down and 3 (itself) was Standby (copy of output below).
[Expert@fw-xxx-003:0]# cphaprob state
Cluster Mode: High Availability (Active Up) with IGMP Membership
Number Unique Address Assigned Load State
1 192.168.1.1 100% Active
2 192.168.1.10 0% Down
3 (local) 192.168.1.2 0% Standby
4. There was no indication from Smartdashboard, Smart Monitor or Tracker that there was any problem with the cluster or any cluster members.
5. There was nothing to indicate how long member 3 had been in this state where it thinks it is healthy but member number 2 is down but as the issue has occurred to greater or lesser extents on at least two previous policy installs, we estimate that it could have been at least two weeks.
6. What we found was that when a Policy was installed, Cluster members 1 and 2 behaved as expected but cluster member 3 reported that member 1 was "DEAD", thought member 2 was down and decided to become ACTIVE. As member 1 remained ACTIVE, this effectively left us with two cluster members in an ACTIVE state and both fighting to process traffic.
7. Although it wasn't evident in the logs of cluster members 1 and 2, a second policy install or a roll-back must have eventually allowed cluster member three to return to a Standby State and allow traffic to pass normally.
8. We left the cluster state as it was to give CheckPoint Support a chance to examine the issue and they agreed that the policy being installed was not the issue but it was the installation of any policy (no matter how minor the change) that would trigger the issue due to the Cluster member 3's "cpha prob state".
9. CheckPoint Support agreed and confirmed for us that this must be corrected and a policy install re-tested and, after restarting the service, "cphaprob state" now reported Active/Standby/Standby and a policy install was performed without issue or incident.
Here is the section of the logs when the last incident occurred showing the policy install and where member 3 reports member 1 as dead and tries to take over:
Nov 14 07:30:00 2017 fw-xxx-003 kernel: [fw4_0];FW-1: SIM (SecureXL Implementation Module) SecureXL device detected.
Nov 14 07:30:00 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 1 to 0
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];fwioctl: Policy has ended. Continuing extending dead timouts (fwha_cul_policy_done_time=141066855)
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 2 (time=141066855, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE)
Nov 14 07:30:01 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 1 to 0
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 1 has changed from 0 to 1
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: user logged from admin
Nov 14 07:30:03 2017 fw-xxx-003 xpand[21398]: admin localhost t +volatile:clish:admin:23803 t
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: User admin logged in with ReadWrite permission
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: cmd by admin: Start executing : show interfaces ... (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: cmd by admin: Processing : show interfaces all (cmd md5: 50efb6e261b20cb2200ce9fe0fa3a6d5)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: fwha_state_freeze: turning freeze type 0 ON (time=141066879, caller=fwha_set_conf)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1];FW-1: fwha_state_freeze: FREEZING state machine at STANDBY (time=141066879, caller=fwha_set_conf, freeze_type=0)
Nov 14 07:30:03 2017 fw-xxx-003 kernel: [fw4_1]; Sync
Nov 14 07:30:03 2017 fw-xxx-003 xpand[21398]: admin localhost t -volatile:clish:admin:23803
Nov 14 07:30:03 2017 fw-xxx-003 clish[23803]: User admin logged out from CLI shell
Nov 14 07:30:04 2017 fw-xxx-003 kernel: [fw4_1];FW-1: [freeze_on_remote] freeze state on remote member 0 has changed from 0 to 1
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.2
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.3
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.4
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.5
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.6
Nov 14 07:30:16 2017 fw-xxx-003 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 14106688.2 and now 14106700.7
What we don't know:
- How did cluster member three get into the state it was in?
- Why did no PNOTES exist on any cluster member?
- Why did it see member 2 as down?
- Why did member 3 behave as it did which was to effectively become ACTIVE at the same time as an ACTIVE member causing a major disruption of all traffic flow.
- Why was member 1 and member 2 unaware of the problem member 3 was having or report anything in their logs about member 3 trying to take over?.
- What can be done to ensure this does not happen again?
Policy Install Procedures Moving Forward
As a result of the issue, we've implemented a procedure where the "cphaprob state" of each cluster member is checked before a Policy install is considered.
Has anyone experienced a similar issue or symptoms?
Did you find any way to auto-alert you that there was an issue with a cluster member?