ClusterXL Down

Matlu · ‎2024-01-18

Hello,

I currently have a 3 member ClusterXL HA.
1 of the members that was in "Standby" status, since a few days ago, has gone to "DOWN" status.

-------------------------------------------------------------------------------------------------------------------------------------

[Expert@fw2:0]# cphaprob show_failover

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Interface Mgmt is down (Cluster Control Protocol packets are not received)
Event time: Sat Jan 13 08:30:25 2024

Cluster failover count:
Failover counter: 139
Time of counter reset: Fri Jul 28 09:33:23 2023 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Fri Jul 28 09:33:50 2023):

No. Time: Transition: CPU: Reason:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 Sat Jan 13 08:30:25 2024 Member 1 -> Member 2 06 Interface Mgmt is down (Cluster Control Protocol packets are not received)
2 Thu Jan 11 21:23:41 2024 Member 3 -> Member 1 14 Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)

------------------------------------------------------------------------------------------------------------------------------------

[Expert@fw2:0]# ethtool Mgmt
Settings for Mgmt:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

-------------------------------------------------------------------------------------------------------------------------------------

What I have found, is that the diagnostic commands, make reference to the "Mgmt" interface of the box being "Down", but the interface, physically and logically are normal (on and linking).

The "ethtool Mgmt" also tells us that the box does detect the connected cable.

Can this error be caused by the other equipment connected to the other side of the cable that is on the Mgmt port (either a SW, or other equipment)?

Greetings.

the_rock · ‎2024-01-18

Please send below from that member

Andy

cphaprob roles

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Best,
Andy

Matlu · ‎2024-01-18

Hello,

I share the result of the diagnostic commands.

Thank you for your comments.

the_rock · ‎2024-01-18

Yea, definitely something with Mgmt interface. Can you confirm you can get interface without topology in smart console cluster object and does not give any errors?

Andy

Best,
Andy

Matlu · ‎2024-01-18

I tried it, and I got the following error message.

Does this make the Firewall responsible for the error?

the_rock · ‎2024-01-18

What does SIC show?

Andy

Best,
Andy

Matlu · ‎2024-01-18

I note this, in the SIC communication.

Unlike my other 2 GW's that work fine, where the "Test SIC Status" shows me a "Communicating".

the_rock · ‎2024-01-18

Thats your issue then, so you can reset SIC without actually having to do cpstop; cpstart, which would load initial policy anyway if you do SIC reset

https://korkutozcan.com/how-to-reset-sic-without-restarting-check-point-gw/

Best,
Andy

Matlu · ‎2024-01-18

Buddy,

Isn't this type of alert due to a connectivity problem?

Thanks.

the_rock · ‎2024-01-18

yes sir

Best,
Andy

Matlu · ‎2024-01-18

Hey,

I followed the steps in the URL, but I get the following error.

Do you think I should validate something else?

I already reset the SIC in the GW CLI, and I also did it in the FW object that is "corrupted" from the SmartConsole.

the_rock · ‎2024-01-18

You need to see why it fails...check routes, ping, traceroute, do some captures. It appears basic connectivity is not there, if even SIC cant be established, which is an absolute must for policy install to work.

Andy

Best,
Andy

Matlu · ‎2024-01-18

My ClusterXL HA has 3 members.

I think it is a problem with the SW to which the management interfaces of each box are connected.

Is it advisable, to check the other equipment, to which my failed box is connected?

---------------------------------------------------------------------------------

ACTIVE FW

[Expert@fw1:0]# ping 172.16.113.44
PING 172.16.113.44 (172.16.113.44) 56(84) bytes of data.
64 bytes from 172.16.113.44: icmp_seq=1 ttl=64 time=0.491 ms
64 bytes from 172.16.113.44: icmp_seq=2 ttl=64 time=0.176 ms

[Expert@fw1:0]# ip r g 172.16.113.44
172.16.113.44 dev Mgmt src 172.16.113.2
cache
[Expert@fw1:0]#
[Expert@fw1:0]# traceroute 172.16.113.44
traceroute to 172.16.113.44 (172.16.113.44), 30 hops max, 40 byte packets
1 172.16.113.44 (172.16.113.44) 0.634 ms 0.648 ms 0.731 ms
[Expert@fw1:0]#

---------------------------------------------------------------------------------

1st FW STANDBY

[Expert@fw3:0]# ping 172.16.113.44
PING 172.16.113.44 (172.16.113.44) 56(84) bytes of data.
64 bytes from 172.16.113.44: icmp_seq=2 ttl=64 time=0.970 ms
64 bytes from 172.16.113.44: icmp_seq=3 ttl=64 time=0.523 m

[Expert@fw3:0]# ip r g 172.16.113.44
172.16.113.44 dev Mgmt src 172.16.113.4
cache
[Expert@fw3:0]#
[Expert@fw3:0]# ip r g 172.16.113.44
172.16.113.44 dev Mgmt src 172.16.113.4
cache
[Expert@fw3:0]#

---------------------------------------------------------------------------------

2nd FW STANDBY (This is the one that is failing)

[Expert@fw2:0]# ping 172.16.113.44
PING 172.16.113.44 (172.16.113.44) 56(84) bytes of data.
From 172.16.113.3 icmp_seq=20 Destination Host Unreachable
From 172.16.113.3 icmp_seq=21 Destination Host Unreachable

[Expert@fw2:0]# ip r g 172.16.113.44
172.16.113.44 dev Mgmt src 172.16.113.3
cache

[Expert@fw2:0]# traceroute 172.16.113.44
traceroute to 172.16.113.44 (172.16.113.44), 30 hops max, 40 byte packets
1 * * *
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *

Thanks. 🙂

the_rock · ‎2024-01-18

Sort of goes without saying, you should go by process of elimination, ie check whatever equipment is "in the picture"

Andy

Best,
Andy

emmap · ‎2024-01-18

For future reference, I would always recommend troubleshooting the connectivity before going straight to resetting SIC. If SIC was established and you then have a connectivity problem, resetting SIC only results in both a connectivity problem and also no SIC.

the_rock · ‎2024-01-18

For sure, 100%. Personally, thats what I always do when people have such an issue.

Andy

Best,
Andy

Are you a member of CheckMates?

ClusterXL Down