- CheckMates
- :
- Products
- :
- General Topics
- :
- Re: ClusterXL Issue
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ClusterXL Issue
Hi guys,
I have the problem that the Cluster XL change several times a go from active to standby. How can I analyze this issue? How get I the change out?
Regards
Ali
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.
1) Open vi and add the following settings
# vi $FWDIR/boot/modules/fwkern.conf
add the Line:
fwha_freez_state_machine_timeout=0
2) Reboot all Gateways
If that is not the issue, please send a message. Then I can give you further debugging informations.
Regards,
Heiko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The fwkern.conf file does not always exist by default. So if the file is not there just go ahead and create it.
now available at maxpowerfirewalls.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ali,
- check cluster state (cphaprob stat)
- check interface error (cphaprob -a if )
- check change time (clish -c "show routed cluster-state detailed")
- check /var/log/messages
Regards
Heiko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your firewall logs look for "Control" log entries (the associated icon is a wrench), as these will tell you exactly why the cluster failed over. Filter "type:Control" can be used to find these log entries in the R77.30 SmartLog GUI or the R80+ SmartConsole.
--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com
now available at maxpowerfirewalls.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Taking advantage of this post, I would like to expose my case.
I have a ClusterXL HA, which has "broken".
Reviewing the "messages" I found some messages that I can not understand, basically are the "Cluster policy installation state freeze ON" and "Cluster policy installation state freeze OFF".
Does this mean that the GW has "frozen"? Or am I interpreting it wrong?
The Cluster right now, only has one member, but what I see with the "cphaprob -a if" is that the SYNC interface is "disconnected"?
I want to find the root-cause of this problem with the Cluster.
Thanks for your comments.
[Expert@fFW:0]# grep CLUS /var/log/messages
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 3899683516/1142038046)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=83875441, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=83875441, Caller=check_required_if_num)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 1142038046)
Nov 2 12:15:40 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: eth8
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 2 12:15:51 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Nov 2 23:01:09 2023 fw1 kernel: [fw4_1];CLUS-110405-1: State remains: ACTIVE! | Reason: Sync interface is down
Nov 2 23:08:30 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State remains: ACTIVE! | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 1142038046/3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=84824979, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=84824979, Caller=check_required_if_num)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: Mgmt
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 3 14:38:55 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
[Expert@FW:0]# cphaprob -a if
CCP mode: Manual (Unicast)
Required interfaces: 6
Required secured interfaces: 1
Interface Name: Status:
eth8 (P) UP
Sync (S) DOWN (4441.6 secs)
Mgmt (P) UP
bond2.30 (LS-P) UP
bond2.240 (LS-P) UP
bond10.450 (LS-P) UP
bond10.460 (LS-P) UP
[Expert@FW:0]# cphaprob state
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 (local) 20.6.5.5 100% ACTIVE(!) GW1
Active PNOTEs: LPRB, IAC
Last member state change event:
Event Code: CLUS-110205
State change: ACTIVE -> ACTIVE(!)
Reason for state change: Interface Sync is down (disconnected / link down)
Event time: Fri Nov 3 14:38:43 2023
Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: ADMIN_DOWN PNOTE
Event time: Sat Aug 12 22:40:11 2023
Cluster failover count:
Failover counter: 115
Time of counter reset: Fri Jul 28 09:33:23 2023 (reboot)
Cheers. 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The freeze has to do with preventing spurious failovers during policy installation and is not related to your problem.
Your Sync interface is not working, check the cable and port settings on both firewalls. Can you ping across the Sync interface?
now available at maxpowerfirewalls.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I get this result.
[Expert@FW:0]# ifconfig -a Sync
Sync Link encap:Ethernet HWaddr 00:1C:7F:8C:CF:66
inet addr:10.10.10.1 Bcast:10.10.10.3 Mask:255.255.255.252
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:1393180812 errors:0 dropped:0 overruns:2125 frame:0
TX packets:1054452841 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:159130019701 (148.2 GiB) TX bytes:783515347386 (729.7 GiB)
[Expert@fw1:0]#
[Expert@fw1:0]# ping 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
From 10.10.10.1 icmp_seq=1 Destination Host Unreachable
From 10.10.10.1 icmp_seq=2 Destination Host Unreachable
From 10.10.10.1 icmp_seq=3 Destination Host Unreachable
I'm confirming with my client, if the equipment was turned off, or disconnected from the "Sync" interface.
I have a concern.
Shouldn't the "cphaprob state" command show me the "Synchronization" IPs of the 2 GWs that are part of a ClusterXL HA?
If so, why the command that I apply in the GW that is now working fine, shows me a totally different IP than the one configured in the Sync interface.
Is this something normal?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Run ethtool Sync, may not have link. If it does and the sync connectivity is through a switch, they are not on the same VLAN with each other. If connectivity is just a single cable, reseat or replace it.
now available at maxpowerfirewalls.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think they "blew" the cable 😅😑
[Expert@fFW:0]# ethtool Sync
Settings for Sync:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: no
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can see cluster flapping during policy installation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.
1) Open vi and add the following settings
# vi $FWDIR/boot/modules/fwkern.conf
add the Line:
fwha_freez_state_machine_timeout=0
2) Reboot all Gateways
If that is not the issue, please send a message. Then I can give you further debugging informations.
Regards,
Heiko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, configure this on gateway and reboot the gateway.
Regards,
Heiko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this setting permanent after reboot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, it is permanent.
Regards,
Heiko
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Heiko,
it works perfectly. The gateway no longer flippig during policy installation.
Thanks for the help.
Ali
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you add parameters in $FWDIR/boot/modules/fwkern.conf file, then they survive a reboot and applied only after a reboot. This is mentioned in the provided sk92723, please read it carefully first. You can additionally read Changing kernel global parameters article.
You can also try to change a parameter by the following commands (applied on-the-fly, not survive a reboot):
fw ctl get in <paramater>
fw ctl set int <parameter> <value>
Example for CUL mechanism, that Heiko provided:
fw ctl get int fwha_freez_state_machine_timeout - prints current value of the parameter
fw ctl set int fwha_freez_state_machine_timeout 0 - sets value for the parameter
I would recommend to try it first, and see if it helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Haiko,
Your solution is correct, I have applied on one cluster and it works smooth. But I have one more cluster have the same problem, but the file is not available.
If this file vi $FWDIR/boot/modules/fwkern.conf is not available then, what we can?
Regards
Yatiraj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The fwkern.conf file does not always exist by default. So if the file is not there just go ahead and create it.
now available at maxpowerfirewalls.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After doing the above steps still, ClusterXL issue is not resolved.
Could you please help me out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please open a support call