Solved: Cluster inconsistency after policy installation

MasterChief117 · ‎2022-07-14

Hello guys

We have noticed that after upgrading from R80.30 to R81.10 two of our clusters behave erratically after policy installation. The primary will remain as active(F) which believing the secondary is in standby while the secondary shows itself as active and the primary as Lost. This will go on for 4-5 minutes until both members converge in the correct state. cphaprob says that there are no ccp sent in the sync interface however tcpdump says otherwise

I noticed that the output of the following parameters is different on both members

[Expert@GW-01:0]# fw ctl get int fwha_mac_magic

fwha_mac_magic = 254

[Expert@GW-01:0]# fw ctl get int fwha_mac_forward_magic

fwha_mac_forward_magic = 253

[Expert@GW-02:0]# fw ctl get int fwha_mac_magic

fwha_mac_magic = 1

[Expert@GW-02:0]# fw ctl get int fwha_mac_forward_magic

fwha_mac_forward_magic = 254

Are this still relevant in R81.10, I have a case open with tac however it has been lagging so any opinion would be helpful

MasterChief117 · ‎2022-07-19

Rebooting the standby member has solved the issue for now, I will monitor this for a moment.

View solution in original post

G_W_Albrecht · ‎2022-07-14

Your commands are outdated! Better look into the current CP_R81.10_ClusterXL_AdminGuide and here: sk42096: Cluster member is stuck in 'Ready' state

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

MasterChief117 · ‎2022-07-14

I did not see any mention of cluster ID or mac magic in the new admin guide. I am afraid that this is something the gateways have inherited from previous versions since they have been upgraded in place. On other R81.10 cluster I have done a clean install the values are the same on both members.

Chris_Atkinson · ‎2022-07-14

What CCP mode does each member believe it is operating in?

Note the following was introduced starting from R80.40

* Support for Cluster Control Protocol (CCP) in Unicast mode for any number of cluster members eliminating the need for CCP Broadcast, Multicast or Automatic modes.

* Eliminated the need for MAC Magic configuration when several clusters are connected to the same subnet.

* Cluster Control Protocol encryption is now enabled by default.

CCSM R77/R80/ELITE

MasterChief117 · ‎2022-07-14

You're onto something while the active member has ccp correctly set as unicast on the standby it is like this

CCP mode: Manual (Multicast)

Chris_Atkinson · ‎2022-07-14

They should ideally both be the same i.e. both auto or both unicast.

CCSM R77/R80/ELITE

MasterChief117 · ‎2022-07-14

Any idea as to how I can change it?

Chris_Atkinson · ‎2022-07-14

To set the CCP mode:

In Gaia Clish, run:
set cluster ccp {auto | unicast | multicast | broadcast}
In Expert mode, run:
cphaconf set_ccp {auto | unicast | multicast | broadcast}

This configuration applies immediately and survives reboot.

CCSM R77/R80/ELITE

the_rock · ‎2022-07-14

I checked in client's cluster and both members have exact same values, mac_magic as 254 and mac_forward_magic as 253.

Can you paste output of cphaprob state and cphaprob -a if?

Best,
Andy

MasterChief117 · ‎2022-07-14

gw2

cphaprob -a if

CCP mode: Manual (Multicast)
Required interfaces: 5
Required secured interfaces: 1

Interface Name: Status:

eth5 UP
eth8 UP
Sync (S) UP
bond0.11 (LS) UP
bond0.48 (LS) UP

S - sync, LM - link monitor, HA/LS - bond type

Virtual cluster interfaces: 16

gw1

cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 5
Required secured interfaces: 1

Interface Name: Status:

eth5 UP
eth8 UP
Sync (S) UP
bond0.11 (LS) UP
bond0.48 (LS) UP

S - sync, LM - link monitor, HA/LS - bond type

Virtual cluster interfaces: 16

the_rock · ‎2022-07-14

[Expert@HostName]# cphaconf set_ccp unicast

I dont believe multicast is supported in R81.10 from what I remember last time customer and I tried changing it, but give it a go. Its been few months, so its possible it was another protocol.

Best,
Andy

MasterChief117 · ‎2022-07-14

I change it however the issue still happens this is what cphaprob state on gw2 looks after policy installation. There is no unique address for gw-1 either

ID Unique Address Assigned Load State Name

1 none 0% GW-01
2 (local) a.b.c.d 100% GW-02

Active PNOTEs: LPRB, IAC

Last member state change event:
Event Code: CLUS-116505
State change: DOWN -> ACTIVE(!)
Reason for state change: All other machines are dead (timeout), Interface Sync is down (Cluster Control Protocol packets are not received)
Event time: Thu Jul 14 16:59:52 2022

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Available on member 1
Event time: Thu Jul 14 16:59:52 2022

Cluster failover count:
Failover counter: 6
Time of counter reset: Fri Jun 24 08:32:48 2022 (reboot)

the_rock · ‎2022-07-14

What mode did you change it to?

Best,
Andy

MasterChief117 · ‎2022-07-14

I changed both members to unicast

the_rock · ‎2022-07-14

So what is currently output of cphaprob state and cphaprob -a if?

Best,
Andy

MasterChief117 · ‎2022-07-14

CCP mode: Manual (Unicast)
Required interfaces: 5
Required secured interfaces: 1

same for both

the_rock · ‎2022-07-14

Okay and what about magic_mac settings? Are they the same? If so, then its possible to make it work, you may need to do a quick failover by running clusterXL_admin down on master or cphastop and cphastart on BOTH.

Andy

Best,
Andy

charlokt · ‎2022-07-14

Same thing happened to me. I upgraded from R80.30 to R81.10 and the cluster was not established as it kept poking both members. Would it be necessary to configure the CCP mode so that the cluster remains Active > Passive? and keep it in the state it comes from R80.30?

charlokt · ‎2022-07-14

Same thing happened to me. I upgraded from R80.30 to R81.10 and the cluster was not established as it kept poking both members. Would it be necessary to configure the CCP mode so that the cluster remains Active > Passive? and keep it in the state it comes from R80.30?

the_rock · ‎2022-07-14

I upgraded one customer's cluster from R80.40 to R81.10 and never had this problem. The mode has always been unicast.

Best,
Andy

Chris_Atkinson · ‎2022-07-14

It's also worth checking CCP encryption per sk169777.

CCSM R77/R80/ELITE

MasterChief117 · ‎2022-07-15

This might be worth a try, I will probably check this on Monday and come back here with the results

MasterChief117 · ‎2022-07-19

This unfortunately did not work

MasterChief117 · ‎2022-07-19

This is exactly the behavior I am seeing too still no meaningful reply from TAC

MasterChief117 · ‎2022-07-19

Rebooting the standby member has solved the issue for now, I will monitor this for a moment.

Chris_Atkinson · ‎2022-07-25

Has it remained stable since?

(Note sk174510 was recently published for an MVC upgrade issue)

CCSM R77/R80/ELITE

MasterChief117 · ‎2022-08-01

It has, I have marked the post as the solution, thank you for you help

Oliver_Fink · ‎2022-07-14

So, if kernel parameters are different – what is the content of $FWDIR/boot/modules/fwkern.conf of both nodes? Any kernel parameters set there differently?

MasterChief117 · ‎2022-07-15

The parameters mentioned above are not in fwkern.conf

the_rock · ‎2022-07-15

I believe it would if it was set manually

$FWDIR/boot/modules/fwkern.conf

Best,
Andy

Are you a member of CheckMates?

Cluster inconsistency after policy installation