Solved: ClusterXL Issue

Ali_Yaymaci · ‎2018-03-31

Hi guys,

I have the problem that the Cluster XL change several times a go from active to standby. How can I analyze this issue? How get I the change out?

Regards

Ali

HeikoAnkenbrand · ‎2018-04-01

A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.

1) Open vi and add the following settings

# vi $FWDIR/boot/modules/fwkern.conf

add the Line:

fwha_freez_state_machine_timeout=0

2) Reboot all Gateways

If that is not the issue, please send a message. Then I can give you further debugging informations.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE, CCVS ➜ www.checkpoint.tips

View solution in original post

Timothy_Hall · ‎2019-08-15

The fwkern.conf file does not always exist by default. So if the file is not there just go ahead and create it.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

View solution in original post

HeikoAnkenbrand · ‎2018-03-31

Hi Ali,

- check cluster state (cphaprob stat)
- check interface error (cphaprob -a if )
- check change time (clish -c "show routed cluster-state detailed")
- check /var/log/messages

Regards

Heiko

➜ CCSM Elite, CCME, CCTE, CCVS ➜ www.checkpoint.tips

Timothy_Hall · ‎2018-03-31

In your firewall logs look for "Control" log entries (the associated icon is a wrench), as these will tell you exactly why the cluster failed over. Filter "type:Control" can be used to find these log entries in the R77.30 SmartLog GUI or the R80+ SmartConsole.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-11-03

Hello,

Taking advantage of this post, I would like to expose my case.

I have a ClusterXL HA, which has "broken".

Reviewing the "messages" I found some messages that I can not understand, basically are the "Cluster policy installation state freeze ON" and "Cluster policy installation state freeze OFF".

Does this mean that the GW has "frozen"? Or am I interpreting it wrong?

The Cluster right now, only has one member, but what I see with the "cphaprob -a if" is that the SYNC interface is "disconnected"?

I want to find the root-cause of this problem with the Cluster.

Thanks for your comments.

[Expert@fFW:0]# grep CLUS /var/log/messages
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 3899683516/1142038046)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=83875441, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=83875441, Caller=check_required_if_num)
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 2 12:15:39 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 1142038046)
Nov 2 12:15:40 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: eth8
Nov 2 12:15:46 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 2 12:15:51 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Nov 2 23:01:09 2023 fw1 kernel: [fw4_1];CLUS-110405-1: State remains: ACTIVE! | Reason: Sync interface is down
Nov 2 23:08:30 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State remains: ACTIVE! | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 1142038046/3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=84824979, Caller=fwha_set_conf, Type=0 State=ACTIVE)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=84824979, Caller=check_required_if_num)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-120002-1: Cluster policy installation completed successfully without negotiation (new Policy ID: 3231739712)
Nov 3 14:38:43 2023 fw1 kernel: [fw4_1];CLUS-110205-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (disconnected / link down)
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local probing has started on interface: Mgmt
Nov 3 14:38:50 2023 fw1 kernel: [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
Nov 3 14:38:55 2023 fw1 kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state DOWN -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

[Expert@FW:0]# cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 6
Required secured interfaces: 1

Interface Name: Status:

eth8 (P) UP
Sync (S) DOWN (4441.6 secs)
Mgmt (P) UP
bond2.30 (LS-P) UP
bond2.240 (LS-P) UP
bond10.450 (LS-P) UP
bond10.460 (LS-P) UP

[Expert@FW:0]# cphaprob state

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 20.6.5.5 100% ACTIVE(!) GW1

Active PNOTEs: LPRB, IAC

Last member state change event:
Event Code: CLUS-110205
State change: ACTIVE -> ACTIVE(!)
Reason for state change: Interface Sync is down (disconnected / link down)
Event time: Fri Nov 3 14:38:43 2023

Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: ADMIN_DOWN PNOTE
Event time: Sat Aug 12 22:40:11 2023

Cluster failover count:
Failover counter: 115
Time of counter reset: Fri Jul 28 09:33:23 2023 (reboot)

Cheers. 🙂

Timothy_Hall · ‎2023-11-03

The freeze has to do with preventing spurious failovers during policy installation and is not related to your problem.

Your Sync interface is not working, check the cable and port settings on both firewalls. Can you ping across the Sync interface?

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-11-03

Hello,

I get this result.

[Expert@FW:0]# ifconfig -a Sync
Sync Link encap:Ethernet HWaddr 00:1C:7F:8C:CF:66
inet addr:10.10.10.1 Bcast:10.10.10.3 Mask:255.255.255.252
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:1393180812 errors:0 dropped:0 overruns:2125 frame:0
TX packets:1054452841 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:159130019701 (148.2 GiB) TX bytes:783515347386 (729.7 GiB)

[Expert@fw1:0]#
[Expert@fw1:0]# ping 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
From 10.10.10.1 icmp_seq=1 Destination Host Unreachable
From 10.10.10.1 icmp_seq=2 Destination Host Unreachable
From 10.10.10.1 icmp_seq=3 Destination Host Unreachable

I'm confirming with my client, if the equipment was turned off, or disconnected from the "Sync" interface.

I have a concern.

Shouldn't the "cphaprob state" command show me the "Synchronization" IPs of the 2 GWs that are part of a ClusterXL HA?

If so, why the command that I apply in the GW that is now working fine, shows me a totally different IP than the one configured in the Sync interface.

Is this something normal?

Thanks.

Timothy_Hall · ‎2023-11-03

Run ethtool Sync, may not have link. If it does and the sync connectivity is through a switch, they are not on the same VLAN with each other. If connectivity is just a single cable, reseat or replace it.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Matlu · ‎2023-11-03

I think they "blew" the cable 😅😑

[Expert@fFW:0]# ethtool Sync
Settings for Sync:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: no

Ali_Yaymaci · ‎2018-04-01

I can see cluster flapping during policy installation.

HeikoAnkenbrand · ‎2018-04-01

A typical issue is ClusterXL under freeze. ClusterXL administrator would like to suppress the messages printed by the Cluster Under Load (CUL) mechanism (see sk92723) in the /var/log/messages file and in the dmesg. I always enable this on the cluster to solve this "under freeze" issue.

1) Open vi and add the following settings

# vi $FWDIR/boot/modules/fwkern.conf

add the Line:

fwha_freez_state_machine_timeout=0

2) Reboot all Gateways

If that is not the issue, please send a message. Then I can give you further debugging informations.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE, CCVS ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2018-04-01

Sorry, configure this on gateway and reboot the gateway.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE, CCVS ➜ www.checkpoint.tips

Ali_Yaymaci · ‎2018-04-01

Is this setting permanent after reboot?

HeikoAnkenbrand · ‎2018-04-01

Yes, it is permanent.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE, CCVS ➜ www.checkpoint.tips

Ali_Yaymaci · ‎2018-04-01

Hello Heiko,

it works perfectly. The gateway no longer flippig during policy installation.

Thanks for the help.

Ali

AlekseiShelepov · ‎2018-04-01

If you add parameters in $FWDIR/boot/modules/fwkern.conf file, then they survive a reboot and applied only after a reboot. This is mentioned in the provided sk92723, please read it carefully first. You can additionally read Changing kernel global parameters article.

You can also try to change a parameter by the following commands (applied on-the-fly, not survive a reboot):

fw ctl get in <paramater>

fw ctl set int <parameter> <value>

Example for CUL mechanism, that Heiko provided:

fw ctl get int fwha_freez_state_machine_timeout - prints current value of the parameter

fw ctl set int fwha_freez_state_machine_timeout 0 - sets value for the parameter

I would recommend to try it first, and see if it helps.

Yatiraj_Panchal · ‎2019-08-15

Hi Haiko,

Your solution is correct, I have applied on one cluster and it works smooth. But I have one more cluster have the same problem, but the file is not available.

If this file vi $FWDIR/boot/modules/fwkern.conf is not available then, what we can?

Regards

Yatiraj

Timothy_Hall · ‎2019-08-15

The fwkern.conf file does not always exist by default. So if the file is not there just go ahead and create it.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Yatiraj_Panchal · ‎2019-10-24

Hi,
After doing the above steps still, ClusterXL issue is not resolved.
Could you please help me out.

_Val_ · ‎2019-10-28

Please open a support call

Are you a member of CheckMates?

ClusterXL Issue