VSX cluster problems

arjanh · ‎2022-04-14

Gentlepersons,

We're having two 15400 appliances running R80.40 JHF Take 131, forming a VSX cluster. Due to log4j, it was decided to turn on IPS. As that was not feasable with 8GB, we took the memory out of two ther disused 15400, and upgraded one gateway to 24GB, and ordered 16GB for the other gateway. When the upgraded gateway was up and running, we switched all active virtual systems over to that gateway. We did encounter some problems in that setup, but we attributed that to one member only having 8GB.

When the second gateway was upgraded to 24GB, we found that the members do not form a cluster:

[Expert@LOC2FWL002:0]# cphaprob stat

Cluster Mode: VSX High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 10.X.Y.4 100% ACTIVE LOC2FWL002
2 10.X.Y.3 0% DOWN LOC1FWL002

Active PNOTEs: None

Last member state change event:
Event Code: CLUS-114904
State change: ACTIVE(!) -> ACTIVE
Reason for state change: Reason for ACTIVE! alert has been resolved
Event time: Sat Mar 26 13:30:22 2022

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: VSX PNOTE
Event time: Wed Mar 23 12:45:26 2022

Cluster failover count:
Failover counter: 121
Time of counter reset: Sat Jun 19 10:27:48 2021 (reboot)

[Expert@LOC2FWL002:0]#

[Expert@LOC1FWL002:0]# cphaprob stat

Cluster Mode: Virtual System Load Sharing (Primary Up)

ID Unique Address Assigned Load State Name

1 10.X.Y.4 100% ACTIVE LOC2FWL002
2 (local) 10.X.Y.3 0% DOWN LOC1FWL002

Active PNOTEs: IAC

Last member state change event:
Event Code: CLUS-110400
State change: INIT -> DOWN
Reason for state change: Sync interface is down
Event time: Fri Mar 25 13:28:58 2022

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: VSX PNOTE
Event time: Wed Mar 23 12:45:26 2022

Cluster failover count:
Failover counter: 121
Time of counter reset: Sat Jun 19 10:27:48 2021 (reboot)

Cluster name: LOCFW001

Virtual Devices Status on each Cluster Member
=============================================

Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down

[Expert@LOC1FWL002:0]#

LOC2FWL002 shows the following in /var/log/messages:

Mar 28 11:44:52 2022 LOC2FWL002 fwk: CLUS-110805-1: State change: ACTIVE -> ACTIVE(!) | Reason: Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)
Mar 28 11:44:53 2022 LOC2FWL002 fwk: CLUS-110805-1: State change: ACTIVE -> ACTIVE(!) | Reason: Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)
Mar 28 11:44:57 2022 LOC2FWL002 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Mar 28 11:44:58 2022 LOC2FWL002 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

Nothing is seen on LOC1FWL002. However, we can see the following in SmartConsole:

Time: 2022-03-28T09:32:57Z
Cluster Information: (ClusterXL) member 2 (10.X.Y.3) is down (Interface Active Check on member 2 (10.X.Y.3) detected a problem (bond0.105 interface is down, 3 interfaces required, only 2 up).).
Type: Control
Policy Name: OTA_Core_Network
Policy Management: locfwm001
Policy Date: 2022-03-25T12:43:52Z
Blade: Firewall
Origin: LOC1FWL002_A0FWL001
Product Family: Access

We see similar messages from a few other virtual systems.

To be sure, bond0and all its sub-interfaces are up all the time, and 10.X.Y.3 is pingable from 10.X.Y.4 and vice versa.

Has anyone seen this? If so, how was it solved?

suneelsharma · ‎2022-04-14

Generally appliances should be configured with the same RAM/interfaces/etc.
Number of processor cores being identical is critical for ClusterXL.
RAM, however, is not a critical factor.

only this is critical:

- same GAIA OS version

- same CCP protocol (multicast/broadcast)

- same operating system (32 bit /64 bit)

- same ipv4 core instances

- same ipv6 core instances

- a license (full license or HA)

To me it looks issue with interfaces so please recheck them on both members and make sure everything is fine with them..

Kaspars_Zibarts · ‎2022-04-14

There seems to be a mixup between members in clustering mode! Look at the first line in cphaprob stat output:

LOC2FWL002: Cluster Mode: VSX High Availability (Primary Up) with IGMP Membership

LOC1FWL002: Cluster Mode: Virtual System Load Sharing (Primary Up)

Its hard to say how you ended up in that situation so I would probably proceed with vsx_util reconfigure to be safe on the faulty cluster member

Chris_Atkinson · ‎2022-04-14

Very nice catch 👍

CCSM R77/R80/ELITE

Are you a member of CheckMates?

VSX cluster problems