Solved: Mismatch in the number of CoreXL FW instances on n...

Dan_Currens · ‎2022-01-13

Two 3800 gateways running in a cluster on R80.40 Jumbo 125.

Node "A" in the cluster /var/log/messages full of:

fwk: CLUS-114802-2: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 1)
fwk: CLUS-113900-2: State change: STANDBY -> DOWN | Reason: Mismatch in the number of CoreXL FW instances has been detected

The cluster will not failover to this node.

Tried forcing the number of CoreXL instances to 6 in CPCONFIG and rebooting.

Tried removing nodes from cluster and re-adding.

Any suggestions?

Cluster does show as a healthy cluster in cphaprob stat:

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.142.6 100% ACTIVE windICSfw1B
2 (local) 192.168.142.5 0% STANDBY windICSfw1A

Active PNOTEs: None

Last member state change event:
Event Code: CLUS-114802
State change: DOWN -> STANDBY
Reason for state change: There is already an ACTIVE member in the cluster (member 1)
Event time: Thu Jan 13 09:36:39 2022

Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: Mismatch in the number of CoreXL FW instances has been detected
Event time: Tue Jan 11 14:19:18 2022

Cluster failover count:
Failover counter: 3007

[Expert@windICSfw1A:0]# grep -c ^processor /proc/cpuinfo
8

[Expert@windICSfw1B:0]# grep -c ^processor /proc/cpuinfo
8

[Expert@windICSfw1A:0]# fw ctl affinity -l -r -a
CPU 0: Mgmt
CPU 1: fw_5
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
CPU 2: fw_3
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
CPU 3: fw_1
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
CPU 4:
CPU 5: fw_4
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
CPU 6: fw_2
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
CPU 7: fw_0
mpdaemon fwd cprid lpd pepd rad rtmd wsdnsd in.asessiond vpnd usrchkd in.acapd cprid cpd
All:
Interface eth5: has multi queue enabled
Interface eth1: has multi queue enabled
Interface eth2: has multi queue enabled

[Expert@windICSfw1B:0]# fw ctl affinity -l -r -a
CPU 0: Mgmt
CPU 1: fw_5
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
CPU 2: fw_3
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
CPU 3: fw_1
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
CPU 4:
CPU 5: fw_4
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
CPU 6: fw_2
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
CPU 7: fw_0
mpdaemon fwd cprid lpd vpnd rad rtmd wsdnsd pepd in.asessiond usrchkd in.acapd cprid cpd
All:
Interface eth5: has multi queue enabled
Interface eth1: has multi queue enabled
Interface eth2: has multi queue enabled

License looks correct CPSG-C-8-U

HeikoAnkenbrand · ‎2022-01-13

Hi @Dan_Currens

You also see the following message "CLUS-114802 There is already an ACTIVE member in the cluster (member X)". But that should be ok.

I would do the following:

1) Reboot both gateways

2) Check if dynamic split enabled
     # dynamic_split -p
    In this case, I have also seen various errors. If necessary, I would disable dynamic split.

3) Install the latest R80.40 JHF 139

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

Timothy_Hall · ‎2022-01-14

Would appear to be some kind of bug since your CoreXL split is identical on both members. However do you have another separate ClusterXL cluster running R80.10 or earlier on the same VLAN(s) as the problematic cluster? Wondering if it could be a Magic MAC issue with an older cluster present, as Automatic MAC Magic was not introduced until R80.20. Make sure that Automatic MAC Magic is enabled on all your clusters that support it with this clish command: show cluster mmagic

Also:

Cluster failover count:
Failover counter: 3007

Your cluster is flapping like crazy, TAC should probably be engaged here to see what is happening.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

nooni · ‎2022-01-13

Hi, i had an similar issue but that was by using wrong license..

However, i tried as you to change in cpnconfig/corexl but it did not help and i ended up disabling corexl, reboot and then enable again and after that it worked.

HeikoAnkenbrand · ‎2022-01-13

Hi @Dan_Currens

You also see the following message "CLUS-114802 There is already an ACTIVE member in the cluster (member X)". But that should be ok.

I would do the following:

1) Reboot both gateways

2) Check if dynamic split enabled
     # dynamic_split -p
    In this case, I have also seen various errors. If necessary, I would disable dynamic split.

3) Install the latest R80.40 JHF 139

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

AmitShmuel · ‎2022-01-16

Dynamic Balancing does not change the number of loaded FW instances, only activates/inactivates if needed, which does not create a mismatch.

It does require the maximum number of FW instances to be loaded, but this should be configured on all clusters members.

_Val_ · ‎2022-01-14

Please open a TAC case for this.

Timothy_Hall · ‎2022-01-14

Would appear to be some kind of bug since your CoreXL split is identical on both members. However do you have another separate ClusterXL cluster running R80.10 or earlier on the same VLAN(s) as the problematic cluster? Wondering if it could be a Magic MAC issue with an older cluster present, as Automatic MAC Magic was not introduced until R80.20. Make sure that Automatic MAC Magic is enabled on all your clusters that support it with this clish command: show cluster mmagic

Also:

Cluster failover count:
Failover counter: 3007

Your cluster is flapping like crazy, TAC should probably be engaged here to see what is happening.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Dan_Currens · ‎2022-01-14

That was it, there was another old cluster in the same VLAN . Thank you for your reply.

Are you a member of CheckMates?

Mismatch in the number of CoreXL FW instances on new 3800 cluster, but no mismatch can be found.