- CheckMates
- :
- Products
- :
- General Topics
- :
- Re: Failover
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Failover
Hello everyone,
I have a question regarding the scenarios in which a firewall in an HA (High Availability) setup could fail over automatically.
Let me explain:
Suppose we have two nodes in HA. We start by installing a Jumbo Hotfix on the standby cluster member. Once the installation is complete, we perform a failover on the active member to switch traffic to the standby, allowing us to test if everything is working properly.
After confirming that everything is fine, we proceed with the same upgrade on the newly standby member. However, after finishing the upgrade, we notice that the cluster automatically switches to active, despite no manual failover command being issued (e.g., using clusterXL_admin down).
What could be causing this behavior?
How can we investigate further to identify potential issues?
Running cpahprob stat gives me the following error:
However, this message is not very clear.
Any insights would be greatly appreciated!
Thank you very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the first sight: it can be a SYNC issue.
- This caused split-brain situation?
- The SYNC interface(s) was/(were) always up?
What does cphaprob syncstat say?
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is the ouput:
Delta Sync Statistics
Sync status: OK
Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0
Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0
Sent messages:
Total generated sync messages................ 2388334
Sent retransmission requests................. 2
Sent retransmission updates.................. 16
Peak fragments per update.................... 2
Received messages:
Total received updates....................... 113310
Received retransmission requests............. 9
Sync Interface:
Name......................................... Mgmt
Link speed................................... 1000Mb/s
Rate......................................... 11065 [KBps]
Peak rate.................................... 11765 [KBps]
Link usage................................... 8%
Total........................................ 123728[MB]
Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50
Timers:
Delta Sync interval (ms)..................... 100
Reset on Tue Feb 18 11:14:19 2025 (triggered by fullsync).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hm. Did you exeperience outage?
And what is the setting here?
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
it is the same
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What does the other GW member say from this itself in the /var/log/messages ?
Or what is the entries on both members at the critital date?
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe this one:
Feb 18 12:06:04 2025 kernel:[fw4_1];CLUS-212100-1: Remote member 2 (state STANDBY -> DOWN) | Reason: FULLSYNC PNOTE
Feb 18 12:06:24 2025 kernel:[fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Feb 18 12:06:24 2025 kernel:[fw4_1];fwha_mvc_init_member_data: Zeroing member 2 data (version and other ...)
Feb 18 12:06:24 2025 kernel:[fw4_1];fwha_mvc_update_version_to_send: MVC Updating versions to send: 4251 0
Feb 18 12:11:27 2025 expert: SSH connection by admin user to Expert Shell with client IP 10.130.181.10 at 12:11 02/18/2025
Feb 18 12:20:54 2025 kernel:perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Feb 18 12:25:43 2025 kernel:[fw4_1];cpas_newconn_ex : called upon something other than tcp SYN. Aborting
Feb 18 12:27:34 2025 kernel:[fw4_1];fwha_mvc_update_member_info: MVC Setting member's 2 version to 4251
Feb 18 12:27:34 2025 kernel:[fw4_1];fwha_mvc_update_version_to_send: MVC Updating versions to send: 4251 0
Feb 18 12:27:34 2025 kernel:[fw4_1];CLUS-212101-1: Remote member 2 (state LOST -> INIT) | Reason: FULLSYNC PNOTE
Feb 18 12:27:37 2025 kernel:[fw4_0];fwx_find_domain_ip_with_msg: ld_get_nat_rule 27491 failed
Feb 18 12:27:37 2025 kernel:[fw4_0];fwx_cache_lookup: fwx_find_domain_ip_with_msg failed
Feb 18 12:28:12 2025 kernel:[fw4_1];CLUS-212100-1: Remote member 2 (state INIT -> DOWN) | Reason: FULLSYNC PNOTE
Feb 18 12:28:18 2025 kernel:[fw4_0];FULLSYNC: Server Starting sync on IPv4 instance #0
Feb 18 12:28:18 2025 kernel:[fw4_0];FULLSYNC: Server Finished to sync instance 0
Feb 18 12:28:18 2025 kernel:[fw4_1];FULLSYNC: Server Starting sync on IPv4 instance #1
Feb 18 12:28:19 2025 kernel:[fw4_1];FULLSYNC: Server Finished to sync instance 1
Feb 18 12:28:19 2025 kernel:[fw4_1];FULLSYNC: Server FULLSYNC Finished, Total Time = 0.77 Seconds.
Feb 18 12:28:19 2025 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Feb 18 12:29:32 2025 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (86%) on the remote member 2 increased above the configured threshold (80%).
Feb 18 12:29:43 2025 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi @RemoteUser
This entry is interesting:
Feb 18 12:06:24 2025 kernel:[fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
According to this article: https://support.checkpoint.com/results/sk/sk125152
CLUS-216400 |
Timeout Control Protocol packet expired member declared as DEAD |
Local member lost connectivity to a specific peer cluster member |
But how long did it take the active/active scenario?
Akos
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But how long did it take the active/active scenario?
Unfortunately, I didn't get a chance to see because I was doing something else in the meantime....
Last member state change event:
But what I don't understand is that by doing a cphaprob stat:
I'm shown a different error than what I see in the var logs...
Why
Event Code: CLUS-114704
In the var logs: CLUS-216400
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think it's because of that:
Feb 18 12:26:15 2025 kernel:[fw4_1];CLUS-110200-2: State change: INIT -> DOWN | Reason: Interface Mgmt is down (disconnected / link down)
Feb 18 12:26:15 2025 kernel:[fw4_1];CLUS-112100-2: State remains: DOWN | Reason: Previous problem resolved, FULLSYNC PNOTE
Feb 18 12:26:15 2025 kernel:igb: eth3: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:15 2025 kernel:igb: eth7: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:16 2025 kernel:igb: eth8: igb_set_rss_hash_opt: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Feb 18 12:26:19 2025 kernel:igb 0000:0e:00.0 Mgmt: igb: Mgmt NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 18 12:26:22 2025 kernel:[fw4_1];fwha_pnote_report_state_internal: rgstr_num: -1, ireg: 0, description: Fullsync, state: 0, type: 0, proc: 0
Feb 18 12:26:23 2025 kernel:[fw4_1];FULLSYNC: Client Started Fullsync - Running from instance 0 till the last instance
Feb 18 12:26:24 2025 kernel:[fw4_1];FULLSYNC: Client Fullsync Finished, Total Time = 0.73 Seconds
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-120108-2: Fullsync PNOTE OFF
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-120122-2: Fullsync completed successfully
Feb 18 12:26:24 2025 kernel:[fw4_1];CLUS-114802-2: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 1)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this a GAiA ClusterXL or SMB cluster ? For GAiA it is explained here:
Upon the recovery of a failed former Active Cluster Member with a higher priority, the role of the
Active Cluster Member may or may not be switched back to that Cluster Member. This depends on the
cluster object configuration - Maintain current active Cluster Member, or Switch to higher priority Cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's a GAIA cluster
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the behaviour depends on the cluster object configuration - Maintain current active Cluster Member, or Switch to higher priority Cluster member.
