Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Ali_Yaymaci
Participant
Jump to solution

ClusterXL Issue

Hi guys,

I have the problem that the Cluster XL change several times a go from active to standby. How can I analyze this issue? How get I the change out?

Regards

Ali

2 Solutions

Accepted Solutions
HeikoAnkenbrand
Champion Champion
Champion

A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.
 

1) Open vi and add the following settings 

  # vi $FWDIR/boot/modules/fwkern.conf

  add the Line:

  fwha_freez_state_machine_timeout=0

2) Reboot all Gateways

If that is not the issue, please send a message. Then I can give you further debugging informations.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

Timothy_Hall
Legend Legend
Legend

The fwkern.conf file does not always exist by default.  So if the file is not there just go ahead and create it.

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

View solution in original post

18 Replies
HeikoAnkenbrand
Champion Champion
Champion

Hi Ali,

- check cluster state (cphaprob stat)
- check interface error (cphaprob -a if )
- check change time (clish -c "show routed cluster-state detailed")
- check /var/log/messages

Regards

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
Timothy_Hall
Legend Legend
Legend

In your firewall logs look for "Control" log entries (the associated icon is a wrench), as these will tell you exactly why the cluster failed over.  Filter "type:Control" can be used to find these log entries in the R77.30 SmartLog GUI or the R80+ SmartConsole.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Matlu
Advisor

Hello,

Taking advantage of this post, I would like to expose my case.

I have a ClusterXL HA, which has "broken".

Reviewing the "messages" I found some messages that I can not understand, basically are the "Cluster policy installation state freeze ON" and "Cluster policy installation state freeze OFF".

Does this mean that the GW has "frozen"? Or am I interpreting it wrong?

The Cluster right now, only has one member, but what I see with the "cphaprob -a if" is that the SYNC interface is "disconnected"?

I want to find the root-cause of this problem with the Cluster.

Thanks for your comments.

 

[Expert@fFW:0]# grep CLUS /var/log/messages
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 3899683516/1142038046)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=83875441, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=83875441, Caller=check_required_if_num)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 1142038046)
Nov 2 12:15:40 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: eth8
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 2 12:15:51 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Nov 2 23:01:09 2023 fw1 kernel: [fw4_1];CLUS-110405-1: State remains: ACTIVE! | Reason: Sync interface is down
Nov 2 23:08:30 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State remains: ACTIVE! | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 1142038046/3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=84824979, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=84824979, Caller=check_required_if_num)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: Mgmt
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 3 14:38:55 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

[Expert@FW:0]# cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 6
Required secured interfaces: 1


Interface Name: Status:

eth8 (P) UP
Sync (S) DOWN (4441.6 secs)
Mgmt (P) UP
bond2.30 (LS-P) UP
bond2.240 (LS-P) UP
bond10.450 (LS-P) UP
bond10.460 (LS-P) UP

[Expert@FW:0]# cphaprob state

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 20.6.5.5 100% ACTIVE(!) GW1


Active PNOTEs: LPRB, IAC

Last member state change event:
Event Code: CLUS-110205
State change: ACTIVE -> ACTIVE(!)
Reason for state change: Interface Sync is down (disconnected / link down)
Event time: Fri Nov 3 14:38:43 2023

Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: ADMIN_DOWN PNOTE
Event time: Sat Aug 12 22:40:11 2023

Cluster failover count:
Failover counter: 115
Time of counter reset: Fri Jul 28 09:33:23 2023 (reboot)

Cheers. 🙂

Timothy_Hall
Legend Legend
Legend

The freeze has to do with preventing spurious failovers during policy installation and is not related to your problem.

Your Sync interface is not working, check the cable and port settings on both firewalls.  Can you ping across the Sync interface?

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Matlu
Advisor

Hello,

I get this result.

SYNCINT.png

[Expert@FW:0]# ifconfig -a Sync
Sync Link encap:Ethernet HWaddr 00:1C:7F:8C:CF:66
inet addr:10.10.10.1 Bcast:10.10.10.3 Mask:255.255.255.252
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:1393180812 errors:0 dropped:0 overruns:2125 frame:0
TX packets:1054452841 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:159130019701 (148.2 GiB) TX bytes:783515347386 (729.7 GiB)

[Expert@fw1:0]#
[Expert@fw1:0]# ping 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
From 10.10.10.1 icmp_seq=1 Destination Host Unreachable
From 10.10.10.1 icmp_seq=2 Destination Host Unreachable
From 10.10.10.1 icmp_seq=3 Destination Host Unreachable

I'm confirming with my client, if the equipment was turned off, or disconnected from the "Sync" interface.

I have a concern.

Shouldn't the "cphaprob state" command show me the "Synchronization" IPs of the 2 GWs that are part of a ClusterXL HA?

If so, why the command that I apply in the GW that is now working fine, shows me a totally different IP than the one configured in the Sync interface.

Is this something normal?

Thanks.

Timothy_Hall
Legend Legend
Legend

Run ethtool Sync, may not have link.  If it does and the sync connectivity is through a switch, they are not on the same VLAN with each other.  If connectivity is just a single cable, reseat or replace it.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Matlu
Advisor

I think they "blew" the cable 😅😑

 

[Expert@fFW:0]# ethtool Sync
Settings for Sync:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: no

Ali_Yaymaci
Participant

I can see cluster flapping during policy installation.

HeikoAnkenbrand
Champion Champion
Champion

A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.
 

1) Open vi and add the following settings 

  # vi $FWDIR/boot/modules/fwkern.conf

  add the Line:

  fwha_freez_state_machine_timeout=0

2) Reboot all Gateways

If that is not the issue, please send a message. Then I can give you further debugging informations.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
HeikoAnkenbrand
Champion Champion
Champion

Sorry, configure this on gateway and reboot the gateway.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
Ali_Yaymaci
Participant

Is this setting permanent after reboot?

HeikoAnkenbrand
Champion Champion
Champion

Yes, it is permanent.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
Ali_Yaymaci
Participant

Hello Heiko,

it works perfectly. The gateway no longer flippig during policy installation.

Thanks for the help.

Ali

AlekseiShelepov
Advisor

If you add parameters in $FWDIR/boot/modules/fwkern.conf file, then they survive a reboot and applied only after a reboot. This is mentioned in the provided sk92723, please read it carefully first. You can additionally read Changing kernel global parameters article.

You can also try to change a parameter by the following commands (applied on-the-fly, not survive a reboot):

fw ctl get in <paramater>

fw ctl set int <parameter> <value>

Example for CUL mechanism, that Heiko provided:

fw ctl get int fwha_freez_state_machine_timeout - prints current value of the parameter

fw ctl set int fwha_freez_state_machine_timeout 0 - sets value for the parameter

I would recommend to try it first, and see if it helps.

Yatiraj_Panchal
Contributor

Hi Haiko,

Your solution is correct, I have applied on one cluster and it works smooth. But I have one more cluster have the same problem, but the file is not available. 

If this file vi $FWDIR/boot/modules/fwkern.conf is not available then, what we can?

 

Regards

Yatiraj

Timothy_Hall
Legend Legend
Legend

The fwkern.conf file does not always exist by default.  So if the file is not there just go ahead and create it.

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Yatiraj_Panchal
Contributor
Hi,
After doing the above steps still, ClusterXL issue is not resolved.
Could you please help me out.
_Val_
Admin
Admin

Please open a support call

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events