Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Geomix7
Collaborator

Cluster XL - Failover due to Fullsync PNOTE ON

Hello All,

We faced an unexpected failover due to Fullsync PNOTE ON  error CLUS-120108.

According to SK125152 

CLUS- 120108

Fullsync PNOTE <ON | OFF>

ON - problem

 

After the failover, I had verified that sync communication is ok and this member is in standby mode in the cluster.

In addition see below syncstat statistics.

syncstatsyncstat

Does anyone face the same issue? Do you know what trigger this behavior?

Thanks

0 Kudos
12 Replies
_Val_
Admin
Admin

Full sync only happens when the second member is coming from boot/initialization, and before it becomes the fully operational cluster member (usually in standby mode). Check the uptime, it seems to me one of your boxes rebooted itself.

Also, fullsync PNOTE should not cause a failover. Please post the logs you got and full message

0 Kudos
Geomix7
Collaborator

Hello Val ,

 

Uptime is 160 days for all members of the cluster. The failover occurs on Sun Mar 14 18:44:26 2021.

 

Please find attached cpwd_admin list & messages.

Thanks

 

0 Kudos
_Val_
Admin
Admin

You did something with SNMP settings, which then called for cpstop/cpstart, which in it turn, caused Active member to go down.

It says it right there in your messages:

Mar 14 18:44:24 2021 HQ pm[16877]: Disabled snmpd
Mar 14 18:44:24 2021 HQ xpand[16895]: Configuration changed from localhost by user admin by the service /usr/sbin/snmpd
Mar 14 18:44:24 2021 HQ snmpd: Destroying the lists of sensors
Mar 14 18:44:24 2021 HQ pm[16877]: Reaped:  snmpd[8268]
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120108-2: Fullsync PNOTE ON
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120130-2: cpstop
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-113500-2: State change: ACTIVE -> DOWN | Reason: FULLSYNC PNOTE - cpstop
0 Kudos
Geomix7
Collaborator

We did not change something manually on the SNMP configuration. I already had open a case with support and I will update the post accordingly.

Thanks

0 Kudos
_Val_
Admin
Admin

Sure, please keep us posted.

This line, however suggests something has been done: 

Mar 14 18:44:24 2021 HQ xpand[16895]: Configuration changed from localhost by user admin
0 Kudos
Geomix7
Collaborator

Hello all ,

The support cannot find the root cause of the issue.

The only suggestion that provided since the issue occurs once is to update to the latest jumbo take (we are on 78) because resolves many performance and stability issues.

 

Thanks

 

 

0 Kudos
the_rock
Legend
Legend

Val is definitely correct...based on that message in the logs you sent, seems that someone manually changed something in the config. Maybe try below command...cd /var/log and then run grep -i PNOTE messages.*

 

Andy

Geomix7
Collaborator

Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120108-2: Fullsync PNOTE ON
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-113500-2: State change: ACTIVE -> DOWN | Reason: FULLSYNC PNOTE - cpstop
Mar 14 18:44:28 2021 HQ kernel: [fw4_0];fwhak_drv_report_process_state: no running process, reporting pnote fwd
Mar 14 18:44:30 2021 HQ kernel: [fw4_1];CLUS-120105-2: routed PNOTE ON
Mar 14 18:44:31 2021 HQ kernel: [fw4_0];fwhak_drv_report_process_state: no running process, reporting pnote cphad
Mar 14 18:45:34 2021 HQ kernel: [fw4_1];CLUS-113601-2: State remains: INIT | Reason: FULLSYNC PNOTE - cpstart
Mar 14 18:45:36 2021 HQ kernel: [fw4_1];CLUS-100201-2: Failover member 2 -> member 1 | Reason: FULLSYNC PNOTE - cpstop
Mar 14 18:46:22 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing has started on interface bond1.399
Mar 14 18:46:51 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing has started on interface bond2
Mar 14 18:46:52 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing had stopped on interface bond2

0 Kudos
Matlu
Advisor

Hello,
I have an error that has to do with a VSX Cluster, which fails to "form", as apparently there is an error with the configuration, or at least is what I understand according to the following message that I have filtered.
It is an environment that is on VMWARE, but the interfaces of both members that make the VSX Cluster, are using the Eth2 interface as SYNC interface.

VSXLAB1.jpgVSXLAB2.jpg
Could someone guide me on how to correct this problem?

0 Kudos
emmap
Employee
Employee

You may need to enable promiscuous mode or like that in your VMWare environment for that virtual network segment. Something is preventing the CCP packets.

0 Kudos
Matlu
Advisor

Can the alert that one “sees” in the CLI of the VSX Cluster be a “false positive”?

At the SmartConsole level, everything looks fine, in “green”, and the main thing, is that the VSX Cluster, does not appear “alarmed”.

VSXLAB3.jpgVSXLAB4.jpgVSXLAB5.jpg

I can install policies without errors on every 1 of my VS's that I have created, however at the CLI level, it looks like the cluster is broken.

At times I see that the Eth2 interface, which is the SYNC interface, is “DOWN”, and at other times, I see what I am sharing in this update.

VSXLAB6.jpg

0 Kudos
emmap
Employee
Employee

No, the CLI output is reliable. Your Sync interface is having issues that need investigating.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events