Re: Cluster XL - Failover due to Fullsync PNOTE O...

Geomix7 · ‎2021-03-15

Hello All,

We faced an unexpected failover due to Fullsync PNOTE ON error CLUS-120108.

According to SK125152

CLUS- 120108

Fullsync PNOTE <ON | OFF>

ON - problem

After the failover, I had verified that sync communication is ok and this member is in standby mode in the cluster.

In addition see below syncstat statistics.

syncstat

Does anyone face the same issue? Do you know what trigger this behavior?

Thanks

_Val_ · ‎2021-03-17

Full sync only happens when the second member is coming from boot/initialization, and before it becomes the fully operational cluster member (usually in standby mode). Check the uptime, it seems to me one of your boxes rebooted itself.

Also, fullsync PNOTE should not cause a failover. Please post the logs you got and full message

Geomix7 · ‎2021-03-17

Hello Val ,

Uptime is 160 days for all members of the cluster. The failover occurs on Sun Mar 14 18:44:26 2021.

Please find attached cpwd_admin list & messages.

Thanks

_Val_ · ‎2021-03-17

You did something with SNMP settings, which then called for cpstop/cpstart, which in it turn, caused Active member to go down.

It says it right there in your messages:

Mar 14 18:44:24 2021 HQ pm[16877]: Disabled snmpd
Mar 14 18:44:24 2021 HQ xpand[16895]: Configuration changed from localhost by user admin by the service /usr/sbin/snmpd
Mar 14 18:44:24 2021 HQ snmpd: Destroying the lists of sensors
Mar 14 18:44:24 2021 HQ pm[16877]: Reaped:  snmpd[8268]
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120108-2: Fullsync PNOTE ON
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120130-2: cpstop
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-113500-2: State change: ACTIVE -> DOWN | Reason: FULLSYNC PNOTE - cpstop

Geomix7 · ‎2021-03-17

We did not change something manually on the SNMP configuration. I already had open a case with support and I will update the post accordingly.

Thanks

_Val_ · ‎2021-03-17

Sure, please keep us posted.

This line, however suggests something has been done:

Mar 14 18:44:24 2021 HQ xpand[16895]: Configuration changed from localhost by user admin

Geomix7 · ‎2021-03-29

Hello all ,

The support cannot find the root cause of the issue.

The only suggestion that provided since the issue occurs once is to update to the latest jumbo take (we are on 78) because resolves many performance and stability issues.

Thanks

the_rock · ‎2021-03-29

Val is definitely correct...based on that message in the logs you sent, seems that someone manually changed something in the config. Maybe try below command...cd /var/log and then run grep -i PNOTE messages.*

Andy

Geomix7 · ‎2021-03-29

Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-120108-2: Fullsync PNOTE ON
Mar 14 18:44:26 2021 HQ kernel: [fw4_1];CLUS-113500-2: State change: ACTIVE -> DOWN | Reason: FULLSYNC PNOTE - cpstop
Mar 14 18:44:28 2021 HQ kernel: [fw4_0];fwhak_drv_report_process_state: no running process, reporting pnote fwd
Mar 14 18:44:30 2021 HQ kernel: [fw4_1];CLUS-120105-2: routed PNOTE ON
Mar 14 18:44:31 2021 HQ kernel: [fw4_0];fwhak_drv_report_process_state: no running process, reporting pnote cphad
Mar 14 18:45:34 2021 HQ kernel: [fw4_1];CLUS-113601-2: State remains: INIT | Reason: FULLSYNC PNOTE - cpstart
Mar 14 18:45:36 2021 HQ kernel: [fw4_1];CLUS-100201-2: Failover member 2 -> member 1 | Reason: FULLSYNC PNOTE - cpstop
Mar 14 18:46:22 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing has started on interface bond1.399
Mar 14 18:46:51 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing has started on interface bond2
Mar 14 18:46:52 2021 HQ kernel: [fw4_1];CLUS-120207-2: LPRB PNOTE : local probing had stopped on interface bond2

Are you a member of CheckMates?

Cluster XL - Failover due to Fullsync PNOTE ON