Re: ClusterXL issues after carrying out steps in s...

Jon_Moss · ‎2020-05-14

Hi all, recently installed a new installation of R80.40 - one mgmt server, two g/ways in a cluster (clusterXL) - we have had an issue since the start retrieving IPS update, anti-bot, NTP from the passive gateway. If you switch them around then the issue remains on the passive so it's not policy. Digging around on these forums lead me to try the steps listed in sk43807 - https://community.checkpoint.com/t5/Enterprise-Appliances-and-Gaia/Connectivity-issues-from-standby-...

as this appeared to help others (albeit on R80.30). Whilst it has indeed resolved the issue of the updates, time sync etc to the passive gway. An unexpected side affect is that clusterXL is not working correctly.

I also cannot now connect to the web interface of the passive node, it does not time out or error - it just hangs when attempting to connect. I have rebooted the passive gateway and it had no affect.

Output below from the cli, which is still reachable: -

cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 10.0.110.2 0% INIT SSSLFW02-pri
2 10.0.110.3 100% ACTIVE SSSLFW02-sec

Active PNOTEs: IAC, HAINIT

Last member state change event:
Event Code: CLUS-112101
State change: INIT
Reason for state change: FULLSYNC PNOTE
Event time: Thu May 14 11:38:05 2020

Cluster failover count:
Failover counter: 0
Time of counter reset: Tue May 5 14:06:38 2020 (reboot)

[Expert@SSSLFW02-pri:0]# cphaprob -i list

Built-in Devices:

Device Name: Interface Active Check
Current state: problem (non-blocking)

Device Name: HA Initialization
Current state: initializing

any help appreciated

Timothy_Hall · ‎2020-05-14

The non-active member in your cluster appears to be stuck in a HA Init state; that member won't interact with the network at all until it completes the initialization and goes standby to ensure it doesn't mess things up before getting the "lay of the land" in the cluster. It seems to be stuck trying to get an initial full sync from the active member, check your sync network and associated configuration.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Jon_Moss · ‎2020-05-15

Thanks Tim, after being frustrated by this yesterday - the issue corrected itself after about two hours of being left alone.

To be clear I changed nothing in the ClusterXL configuration which was working fine up until that point. I added to the table.def file on the security management server port 80,443,53,123 and pushed the policy as per the instructions in sk43807. Initially all was good then the clustering stopped.

This is a virtual deployment on VMware, perhaps I was inpatient with it as i did reboot the passive and maybe taht caused more problems. This is the second time i've seen issues with ClusterXL having a wobble when something is changed at OS level.

I'm probably being punished for going straight to R80.40.. but other than these couple of niggles it's been great, so fingers crossed now we won't see anything further.

Thanks for taking the time to reply

PS: to add, i've failed over and failed back a couple of times this morning, and no issues

Reevsie147 · ‎2020-12-10

Hey Jon,

Just wondering whether this issue ever reoccurred for you and if so whether you managed to resolve it?

I'm facing a very similar issue in that whenever I reboot a cluster member that member then comes up in INIT state and can stay like that for over an hour before it goes active:

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 169.254.99.1 0% INIT fw-01
2 169.254.99.2 100% ACTIVE fw-02

The other member shows that connectivity is lost during this time:

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 169.254.99.1 0% LOST fw-01
2 (local) 169.254.99.2 100% ACTIVE fw-02

And if I reboot them both, they'll just sit there for ages unable to see each other:

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 169.254.99.1 0% INIT fw-01

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

2 (local) 169.254.99.2 0% INIT fw-02

I've checked connectivity on all interfaces between the members and all seems fine.

The strangest thing is that once the cluster has formed, if I gracefully take the members offline "clusterXL_admin down/up" everything works absolutely perfectly. I've failed across like this many times with no issue, but if a host is rebooted or suffers power failure, we are back to the original issue.

We are also running on ESX (with Promicuous mode, MAC Address changes and Forged Transmits enabled on each connected port group)

Any advice you or anybody else could share would be greatly appreciated.

Reevsie147 · ‎2021-01-22

Just in case anyone else happens to experience the issue I was having. I think I've finally worked it out so dropping this here in case it saves anyone else a big headache!

Symptoms were:

After a cluster member reboot, the cluster wouldn't converge for approximately an hour with the rebooted member sat in INIT state
Once the cluster finally did converge, operation was flawless and we could perform manual failovers with clusterXL_admin up/down without any issues.
If a member was ever rebooted (or suffered a power issues), we would be back to waiting for an hour until the cluster came back online.

I treble checked ESX portgroup settings and enabled mac-learning on DVS switches as well as verifying NTP was working correctly but still no luck, it would still take approximately an hour for the Cluster to come up.

Then I randomly noticed when looking at the monitoring for the cluster on the Gateways and Cluster tab that when a member had just been booted it stated "uptime" as a minus figure (around -3600 seconds) so I rechecked the date/time and timezones on the cluster members and they were all correct.

I then noticed that the time on my ESX host was a hour out....fixed it and rebooted the 2 Guest cluster members and everything began working perfectly. I'm still a bit stumped as to why this should make any difference as GAiA was reporting the time etc correctly and I thought the guest was abstracted from the host hypervisor, but it works!

TLDR: Save yourself a headache and make sure the time on your ESX hosts is correct if attempting to use ClusterXL on CloudGuard IaaS

Scott_Paisley · ‎2021-11-26

This post just saved us.

We have a brand new hardware cluster and both boxes came up in INIT state becuase the clocks were wrong when they were booted

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 10.130.15.2 0% INIT xxxxxxxxxFW2

Active PNOTEs: IAC, HAINIT

Last member state change event:
Event Code: CLUS-113601
State change: INIT
Reason for state change: FULLSYNC PNOTE - cpstart
Event time: Fri Nov 26 21:46:11 2021

Cluster failover count:
Failover counter: 0
Time of counter reset: Sat Nov 27 03:52:59 2021 (reboot)

Note the Time of counter reset is in the future...

A reboot of the box resolved the problem.

Are you a member of CheckMates?

ClusterXL issues after carrying out steps in sk43807