Solved: Re: ClusterXL HA issue

NiladriSarkar · ‎2025-12-11

Hi all,

This has now happened a few times in last 6 months. The Standby firewall doesnot receive the CCP packets and marks Sync interface as down. The cluster goes into a split brain scenario.
It resolves itself in less than a min. All BGP peers are re-established. Any idea why is this happening ?

Note: Sys_admin installed Threat Prevention policy right after this. There were spike detective alerts for temain right before this happened ( can be totally unrealted ).

Active firewall

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-114402-2: State change: ACTIVE -> STANDBY | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210305-2: Remote member 1 (state DOWN -> ACTIVE(!)) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-210300-2: Remote member 1 (state ACTIVE(!) -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-114704-2: State change: STANDBY -> ACTIVE | Reason: No other ACTIVE members have been found in the cluster

Dec 11 01:55:27 2025 F1-2 fwk: CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-214802-2: Remote member 1 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-211700-2: Remote member 1 (state STANDBY -> DOWN) | Reason: ROUTED PNOTE
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-100201-2: Failover member 2 -> member 1 | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-120105-2: routed PNOTE ON
Dec 11 01:55:27 2025 F1-2 fwk: CLUS-111705-2: State change: ACTIVE -> ACTIVE(!) | Reason: ROUTED PNOTE

Dec 11 01:55:28 2025 F1-2 fwk: CLUS-120105-2: routed PNOTE OFF
Dec 11 01:55:28 2025 F1-2 fwk: CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

Standby Firewall

Dec 11 01:55:24 2025 F1-1 fwk: CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:25 2025 F1-1 fwk: CLUS-216400-1: Remote member 2 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Dec 11 01:55:25 2025 F1-1 fwk: CLUS-116505-1: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:25 2025 F1-1 fwk: CLUS-100201-1: Failover member 2 -> member 1 | Reason: Available on member 2
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-110305-1: State change: ACTIVE! -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-214904-1: Remote member 2 (state STANDBY -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-120105-1: routed PNOTE ON
Dec 11 01:55:27 2025 F1-1 fwk: CLUS-111700-1: State change: STANDBY -> DOWN | Reason: ROUTED PNOTE

Dec 11 01:55:27 2025 F1-1 fwk: CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Dec 11 01:55:27 2025 F1-1 routed[168442]: [routed] ERROR: cpcl_recv: Failed to receive cluster message header, connection will need to be reestablished. errno = 104 (Connection reset by peer)
Dec 11 01:55:27 2025 F1-1 routed[168442]: [routed] ERROR: cpcl_recv: deleting peer task 0x8f1aee4 due to failure to read from the socket
Dec 11 01:56:02 2025 F1-1 fwk: CLUS-120105-1: routed PNOTE OFF
Dec 11 01:56:02 2025 F1-1 fwk: CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

Vincent_Bacher · ‎2025-12-11

I see this suspicious messages:

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1

question is why this device consumes so much cpu. I guess it's VSX and maybe to be analysed what exactly caused the spike and to consider an adjustment of the VS core assignment.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

View solution in original post

the_rock · ‎2025-12-11

If I were you, would install recommended one, take 119, but either way, what @Vincent_Bacher said, makes total sense, at least to me.

Best,
Andy

View solution in original post

Chris_Atkinson · ‎2025-12-11

Good idea since T119 has some CXL fixes

CCSM R77/R80/ELITE

View solution in original post

NiladriSarkar · ‎2025-12-11

right ! thanks

PRJ-62301, PMTR-115027 ClusterXL In ClusterXL High Availability (HA), in some scenarios, the Active cluster member stops sending Cluster Control Protocol (CCP) heartbeats, and the Standby member may misinterpret this as an Interface Active Check (IAC) failure.

View solution in original post

Vincent_Bacher · ‎2025-12-11

I see this suspicious messages:

Dec 11 01:53:31 2025 F1-2 spike_detective: spike info: type: cpu, cpu core: 42, top consumer: fwk0_dev_57, start time: 11/12/25 01:53:18, spike duration (sec): 12, initial cpu usage: 91, average cpu usage: 74, perf taken: 0

Dec 11 01:54:37 2025 F1-2 spike_detective: spike info: type: thread, thread id: 115061, thread name: temain, start time: 11/12/25 01:54:30, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1

question is why this device consumes so much cpu. I guess it's VSX and maybe to be analysed what exactly caused the spike and to consider an adjustment of the VS core assignment.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

NiladriSarkar · ‎2025-12-11

It is not a VSX. The CPU spikes are short lived.. mostly for TEMAIN threads. Its a 28600 box and not being over utilized. Will investigate the cpu issue anyways. thanks

Chris_Atkinson · ‎2025-12-11

ClusterXL typically has split brain prevention mechanisms so either is overwhelmed or their is some Layer-2 issue.

What is the topology of the sync interface? Is it a bond, are there intermediate switches etc.

CCSM R77/R80/ELITE

NiladriSarkar · ‎2025-12-11

not a bond interface.. and no switch in between. they are directly connected. The cable was replaced after we saw this issue earlier.

Chris_Atkinson · ‎2025-12-11

Which version / JHF are we working with?

CCSM R77/R80/ELITE

NiladriSarkar · ‎2025-12-11

its 81.20 take 113

the_rock · ‎2025-12-11

If I were you, would install recommended one, take 119, but either way, what @Vincent_Bacher said, makes total sense, at least to me.

Best,
Andy

NiladriSarkar · ‎2025-12-11

yup, thank you. Will check on the CPU usage and plan to install t119.

Chris_Atkinson · ‎2025-12-11

Good idea since T119 has some CXL fixes

CCSM R77/R80/ELITE

NiladriSarkar · ‎2025-12-11

right ! thanks

PRJ-62301, PMTR-115027 ClusterXL In ClusterXL High Availability (HA), in some scenarios, the Active cluster member stops sending Cluster Control Protocol (CCP) heartbeats, and the Standby member may misinterpret this as an Interface Active Check (IAC) failure.

the_rock · ‎2025-12-11

You can also follow below for historical data:

https://community.checkpoint.com/t5/Security-Gateways/How-to-view-cpview-history-file-on-other-machi...

or

cpview -t and then press t again

Best,
Andy

the_rock · ‎2025-12-11

I feel it will improve the situation, for sure.

Best,
Andy

the_rock · ‎2025-12-11

I see the point Vince is making. That could absolutely happen due to CPU spike.

Best,
Andy

Vincent_Bacher · ‎2025-12-11

Perhaps I didn't express myself clearly enough as a non-native English speaker, but thank you for the flowers.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

the_rock · ‎2025-12-11

You absolutely did, I got all you had to say. Dont worry, English is not my first language either lol

Best,
Andy

Are you a member of CheckMates?

ClusterXL HA issue