Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
NiladriSarkar
Contributor
Jump to solution

ClusterXL HA issue

Hi all,

This has now happened a few times in last 6 months. The Standby firewall doesnot receive the CCP packets and marks Sync interface as down. The cluster goes into a split brain scenario.
It resolves itself in less than a min. All BGP peers are re-established. Any idea why is this happening ?

Note: Sys_admin installed Threat Prevention policy right after this. There were spike detective alerts for temain right before this happened ( can be totally unrealted ).


Active firewall

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-114402-2: State change: ACTIVE -> STANDBY | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210305-2: Remote member 1 (state DOWN -> ACTIVE(!)) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210300-2: Remote member 1 (state ACTIVE(!) -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-114704-2: State change: STANDBY -> ACTIVE | Reason: No other ACTIVE members have been found in the cluster

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-214802-2: Remote member 1 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-211700-2: Remote member 1 (state STANDBY -> DOWN) | Reason: ROUTED PNOTE
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-100201-2: Failover member 2 -> member 1 | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-120105-2: routed PNOTE ON
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-111705-2: State change: ACTIVE -> ACTIVE(!) | Reason: ROUTED PNOTE

Dec 11 01:55:28 2025 F1-2 fwk: CLUS-120105-2: routed PNOTE OFF
Dec 11 01:55:28 2025 F1-2 fwk: CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved


Standby Firewall

Dec 11 01:55:24 2025 F1-1 fwk: CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:25 2025 F1-1 fwk: CLUS-216400-1: Remote member 2 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Dec 11 01:55:25 2025 F1-1 fwk: CLUS-116505-1: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:25 2025 F1-1 fwk: CLUS-100201-1: Failover member 2 -> member 1 | Reason: Available on member 2
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-110305-1: State change: ACTIVE! -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-214904-1: Remote member 2 (state STANDBY -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-120105-1: routed PNOTE ON
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-111700-1: State change: STANDBY -> DOWN | Reason: ROUTED PNOTE

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-1 routed[168442]: [routed] ERROR: cpcl_recv: Failed to receive cluster message header, connection will need to be reestablished. errno = 104 (Connection reset by peer)
Dec 11 01:55:27 2025 F1-1 routed[168442]: [routed] ERROR: cpcl_recv: deleting peer task 0x8f1aee4 due to failure to read from the socket
Dec 11 01:56:02 2025 F1-1 fwk: CLUS-120105-1: routed PNOTE OFF
Dec 11 01:56:02 2025 F1-1 fwk: CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

0 Kudos
4 Solutions

Accepted Solutions
Vincent_Bacher

I see this suspicious messages:

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1


question is why this device consumes so much cpu. I guess it's VSX and maybe to be analysed what exactly caused the spike and to consider an adjustment of the VS core assignment.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

View solution in original post

the_rock
MVP Platinum
MVP Platinum

If I were you, would install recommended one, take 119, but either way, what @Vincent_Bacher said, makes total sense, at least to me.

Best,
Andy

View solution in original post

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Good idea since T119 has some CXL fixes 

CCSM R77/R80/ELITE

View solution in original post

NiladriSarkar
Contributor

right ! thanks

PRJ-62301, PMTR-115027 ClusterXL In ClusterXL High Availability (HA), in some scenarios, the Active cluster member stops sending Cluster Control Protocol (CCP) heartbeats, and the Standby member may misinterpret this as an Interface Active Check (IAC) failure.

View solution in original post

15 Replies
Vincent_Bacher

I see this suspicious messages:

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1


question is why this device consumes so much cpu. I guess it's VSX and maybe to be analysed what exactly caused the spike and to consider an adjustment of the VS core assignment.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
NiladriSarkar
Contributor

It is not a VSX. The CPU spikes are short lived.. mostly for TEMAIN threads. Its a 28600 box and not being over utilized. Will investigate the cpu issue anyways. thanks

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

ClusterXL typically has split brain prevention mechanisms so either is overwhelmed or their is some Layer-2 issue.

What is the topology of the sync interface? Is it a bond, are there intermediate switches etc.

CCSM R77/R80/ELITE
0 Kudos
NiladriSarkar
Contributor

not a bond interface.. and no switch in between. they are directly connected. The cable was replaced after we saw this issue earlier. 

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Which version / JHF are we working with?

CCSM R77/R80/ELITE
0 Kudos
NiladriSarkar
Contributor

its 81.20 take 113

0 Kudos
the_rock
MVP Platinum
MVP Platinum

If I were you, would install recommended one, take 119, but either way, what @Vincent_Bacher said, makes total sense, at least to me.

Best,
Andy
0 Kudos
NiladriSarkar
Contributor

yup, thank you. Will check on the CPU usage and plan to install t119. 

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Good idea since T119 has some CXL fixes 

CCSM R77/R80/ELITE
NiladriSarkar
Contributor

right ! thanks

PRJ-62301, PMTR-115027 ClusterXL In ClusterXL High Availability (HA), in some scenarios, the Active cluster member stops sending Cluster Control Protocol (CCP) heartbeats, and the Standby member may misinterpret this as an Interface Active Check (IAC) failure.

the_rock
MVP Platinum
MVP Platinum

You can also follow below for historical data:

https://community.checkpoint.com/t5/Security-Gateways/How-to-view-cpview-history-file-on-other-machi...

or

 

cpview -t and then press t again

Best,
Andy
0 Kudos
the_rock
MVP Platinum
MVP Platinum

I feel it will improve the situation, for sure.

Best,
Andy
0 Kudos
the_rock
MVP Platinum
MVP Platinum

I see the point Vince is making. That could absolutely happen due to CPU spike.

Best,
Andy
0 Kudos
Vincent_Bacher

Perhaps I didn't express myself clearly enough as a non-native English speaker, but thank you for the flowers.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
0 Kudos
the_rock
MVP Platinum
MVP Platinum

You absolutely did, I got all you had to say. Dont worry, English is not my first language either lol

Best,
Andy
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events