Re: Failover

RemoteUser · ‎2025-02-18

Hello everyone,

I have a question regarding the scenarios in which a firewall in an HA (High Availability) setup could fail over automatically.

Let me explain:

Suppose we have two nodes in HA. We start by installing a Jumbo Hotfix on the standby cluster member. Once the installation is complete, we perform a failover on the active member to switch traffic to the standby, allowing us to test if everything is working properly.

After confirming that everything is fine, we proceed with the same upgrade on the newly standby member. However, after finishing the upgrade, we notice that the cluster automatically switches to active, despite no manual failover command being issued (e.g., using clusterXL_admin down).

What could be causing this behavior?

How can we investigate further to identify potential issues?

Running cpahprob stat gives me the following error:

Last member state change event: Event Code: CLUS-114704 State change: STANDBY -> ACTIVE Reason for state change: No other ACTIVE members have been found in the cluster Event time:

However, this message is not very clear.

Any insights would be greatly appreciated!

Thank you very much.

AkosBakos · ‎2025-02-18

For the first sight: it can be a SYNC issue.

This caused split-brain situation?
The SYNC interface(s) was/(were) always up?

What does cphaprob syncstat say?

----------------
\m/_(>_<)_\m/

RemoteUser · ‎2025-02-18

This is the ouput:

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 2388334
Sent retransmission requests................. 2
Sent retransmission updates.................. 16
Peak fragments per update.................... 2

Received messages:
Total received updates....................... 113310
Received retransmission requests............. 9

Sync Interface:
Name......................................... Mgmt
Link speed................................... 1000Mb/s
Rate......................................... 11065 [KBps]
Peak rate.................................... 11765 [KBps]
Link usage................................... 8%
Total........................................ 123728[MB]

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Tue Feb 18 11:14:19 2025 (triggered by fullsync).

AkosBakos · ‎2025-02-18

Hm. Did you exeperience outage?

And what is the setting here?

----------------
\m/_(>_<)_\m/

RemoteUser · ‎2025-02-18

it is the same

AkosBakos · ‎2025-02-18

What does the other GW member say from this itself in the /var/log/messages ?

Or what is the entries on both members at the critital date?

----------------
\m/_(>_<)_\m/

RemoteUser · ‎2025-02-18

Maybe this one:
Feb 18 12:06:04 2025 kernel:[fw4_1];CLUS-212100-1: Remote member 2 (state STANDBY -> DOWN) | Reason: FULLSYNC PNOTE
Feb 18 12:06:24 2025 kernel:[fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Feb 18 12:06:24 2025 kernel:[fw4_1];fwha_mvc_init_member_data: Zeroing member 2 data (version and other ...)
Feb 18 12:06:24 2025 kernel:[fw4_1];fwha_mvc_update_version_to_send: MVC Updating versions to send: 4251 0
Feb 18 12:11:27 2025 expert: SSH connection by admin user to Expert Shell with client IP 10.130.181.10 at 12:11 02/18/2025
Feb 18 12:20:54 2025 kernel:perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Feb 18 12:25:43 2025 kernel:[fw4_1];cpas_newconn_ex : called upon something other than tcp SYN. Aborting
Feb 18 12:27:34 2025 kernel:[fw4_1];fwha_mvc_update_member_info: MVC Setting member's 2 version to 4251
Feb 18 12:27:34 2025 kernel:[fw4_1];fwha_mvc_update_version_to_send: MVC Updating versions to send: 4251 0
Feb 18 12:27:34 2025 kernel:[fw4_1];CLUS-212101-1: Remote member 2 (state LOST -> INIT) | Reason: FULLSYNC PNOTE
Feb 18 12:27:37 2025 kernel:[fw4_0];fwx_find_domain_ip_with_msg: ld_get_nat_rule 27491 failed
Feb 18 12:27:37 2025 kernel:[fw4_0];fwx_cache_lookup: fwx_find_domain_ip_with_msg failed
Feb 18 12:28:12 2025 kernel:[fw4_1];CLUS-212100-1: Remote member 2 (state INIT -> DOWN) | Reason: FULLSYNC PNOTE
Feb 18 12:28:18 2025 kernel:[fw4_0];FULLSYNC: Server Starting sync on IPv4 instance #0
Feb 18 12:28:18 2025 kernel:[fw4_0];FULLSYNC: Server Finished to sync instance 0
Feb 18 12:28:18 2025 kernel:[fw4_1];FULLSYNC: Server Starting sync on IPv4 instance #1
Feb 18 12:28:19 2025 kernel:[fw4_1];FULLSYNC: Server Finished to sync instance 1
Feb 18 12:28:19 2025 kernel:[fw4_1];FULLSYNC: Server FULLSYNC Finished, Total Time = 0.77 Seconds.
Feb 18 12:28:19 2025 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Feb 18 12:29:32 2025 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (86%) on the remote member 2 increased above the configured threshold (80%).
Feb 18 12:29:43 2025 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

AkosBakos · ‎2025-02-18

hi @RemoteUser

This entry is interesting:

Feb 18 12:06:24 2025 kernel:[fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

According to this article: https://support.checkpoint.com/results/sk/sk125152

CLUS-216400

Timeout Control Protocol packet expired member declared as DEAD

Local member lost connectivity to a specific peer cluster member

But how long did it take the active/active scenario?

Akos

----------------
\m/_(>_<)_\m/

RemoteUser · ‎2025-02-18

But how long did it take the active/active scenario?
Unfortunately, I didn't get a chance to see because I was doing something else in the meantime....

Last member state change event:

But what I don't understand is that by doing a cphaprob stat:
I'm shown a different error than what I see in the var logs...
Why

Event Code: CLUS-114704

In the var logs: CLUS-216400

RemoteUser · ‎2025-02-18

I think it's because of that:

Feb 18 12:26:15 2025 kernel:[fw4_1];CLUS-110200-2: State change: INIT -> DOWN | Reason: Interface Mgmt is down (disconnected / link down)
Feb 18 12:26:15 2025 kernel:[fw4_1];CLUS-112100-2: State remains: DOWN | Reason: Previous problem resolved, FULLSYNC PNOTE
Feb 18 12:26:15 2025 kernel:igb: eth3: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:15 2025 kernel:igb: eth7: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:16 2025 kernel:igb: eth8: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:19 2025 kernel:igb 0000:0e:00.0 Mgmt: igb: Mgmt NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 18 12:26:22 2025 kernel:[fw4_1];fwha_pnote_report_state_internal: rgstr_num: -1, ireg: 0, description: Fullsync, state: 0, type: 0, proc: 0
Feb 18 12:26:23 2025 kernel:[fw4_1];FULLSYNC: Client Started Fullsync - Running from instance 0 till the last instance
Feb 18 12:26:24 2025 kernel:[fw4_1];FULLSYNC: Client Fullsync Finished, Total Time = 0.73 Seconds
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-120108-2: Fullsync PNOTE OFF
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-120122-2: Fullsync completed successfully
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-114802-2: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 1)

G_W_Albrecht · ‎2025-02-18

Is this a GAiA ClusterXL or SMB cluster ? For GAiA it is explained here:

https://sc1.checkpoint.com/documents/R81.20/WebAdminGuides/EN/CP_R81.20_ClusterXL_AdminGuide/Content...

Upon the recovery of a failed former Active Cluster Member with a higher priority, the role of the 
Active Cluster Member may or may not be switched back to that Cluster Member. This depends on the 
cluster object configuration - Maintain current active Cluster Member, or Switch to higher priority Cluster.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

RemoteUser · ‎2025-02-18

It's a GAIA cluster

G_W_Albrecht · ‎2025-02-18

So the behaviour depends on the cluster object configuration - Maintain current active Cluster Member, or Switch to higher priority Cluster member.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

Failover