Gentlepersons,
We're having two 15400 appliances running R80.40 JHF Take 131, forming a VSX cluster. Due to log4j, it was decided to turn on IPS. As that was not feasable with 8GB, we took the memory out of two ther disused 15400, and upgraded one gateway to 24GB, and ordered 16GB for the other gateway. When the upgraded gateway was up and running, we switched all active virtual systems over to that gateway. We did encounter some problems in that setup, but we attributed that to one member only having 8GB.
When the second gateway was upgraded to 24GB, we found that the members do not form a cluster:
[Expert@LOC2FWL002:0]# cphaprob stat
Cluster Mode: VSX High Availability (Primary Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 (local) 10.X.Y.4 100% ACTIVE LOC2FWL002
2 10.X.Y.3 0% DOWN LOC1FWL002
Active PNOTEs: None
Last member state change event:
Event Code: CLUS-114904
State change: ACTIVE(!) -> ACTIVE
Reason for state change: Reason for ACTIVE! alert has been resolved
Event time: Sat Mar 26 13:30:22 2022
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: VSX PNOTE
Event time: Wed Mar 23 12:45:26 2022
Cluster failover count:
Failover counter: 121
Time of counter reset: Sat Jun 19 10:27:48 2021 (reboot)
[Expert@LOC2FWL002:0]#
[Expert@LOC1FWL002:0]# cphaprob stat
Cluster Mode: Virtual System Load Sharing (Primary Up)
ID Unique Address Assigned Load State Name
1 10.X.Y.4 100% ACTIVE LOC2FWL002
2 (local) 10.X.Y.3 0% DOWN LOC1FWL002
Active PNOTEs: IAC
Last member state change event:
Event Code: CLUS-110400
State change: INIT -> DOWN
Reason for state change: Sync interface is down
Event time: Fri Mar 25 13:28:58 2022
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: VSX PNOTE
Event time: Wed Mar 23 12:45:26 2022
Cluster failover count:
Failover counter: 121
Time of counter reset: Sat Jun 19 10:27:48 2021 (reboot)
Cluster name: LOCFW001
Virtual Devices Status on each Cluster Member
=============================================
ID | Weight| LOC2FWL002| LOC1FWL002
| | | [local]
-------+-------+-----------+-----------
2 | 10 | ACTIVE | DOWN
3 | 10 | ACTIVE(!) | DOWN
4 | 10 | ACTIVE(!) | DOWN
5 | 10 | ACTIVE | DOWN
6 | 10 | ACTIVE | DOWN
7 | 10 | ACTIVE(!) | INIT
8 | 10 | ACTIVE | INIT
---------------+-----------+-----------
Active | 7 | 0
Weight | 70 | 0
Weight (%) | 100 | 0
Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down
[Expert@LOC1FWL002:0]#
LOC2FWL002 shows the following in /var/log/messages:
Mar 28 11:44:52 2022 LOC2FWL002 fwk: CLUS-110805-1: State change: ACTIVE -> ACTIVE(!) | Reason: Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)
Mar 28 11:44:53 2022 LOC2FWL002 fwk: CLUS-110805-1: State change: ACTIVE -> ACTIVE(!) | Reason: Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)
Mar 28 11:44:57 2022 LOC2FWL002 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Mar 28 11:44:58 2022 LOC2FWL002 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Nothing is seen on LOC1FWL002. However, we can see the following in SmartConsole:
Time: 2022-03-28T09:32:57Z
Cluster Information: (ClusterXL) member 2 (10.X.Y.3) is down (Interface Active Check on member 2 (10.X.Y.3) detected a problem (bond0.105 interface is down, 3 interfaces required, only 2 up).).
Type: Control
Policy Name: OTA_Core_Network
Policy Management: locfwm001
Policy Date: 2022-03-25T12:43:52Z
Blade: Firewall
Origin: LOC1FWL002_A0FWL001
Product Family: Access
We see similar messages from a few other virtual systems.
To be sure, bond0and all its sub-interfaces are up all the time, and 10.X.Y.3 is pingable from 10.X.Y.4 and vice versa.
Has anyone seen this? If so, how was it solved?