Problem after migration to R80.20 - ClusterXL

Rafael_Lima1 · ‎2019-02-26

After migrating from version R80.10 to version R80.20, our cluster presents the following messages.

Feb 25 16:40:45 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 25 16:40:46 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

Feb 26 06:55:33 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 26 06:55:33 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

Feb 26 13:49:52 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 26 13:49:52 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

In this cluster the backup traffic passes, causing a high consumption, before the migration we had the same consumption, but did not occur messages / errors.

Another thing, we are verifying a connectivity problem on our servers and the time is similar to that listed in the above messages. Can these messages identify traffic disruption? We have seen that it does not occur on all servers, but in the most sensitive the connection is interrupted, causing serious problems on servers that use NFS.

Another detail, we are getting the following message when executing the "show cluster failover" command, but we did not run the cpstop on the gateways

FWINTRA1> show cluster failover

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: FULLSYNC PNOTE - cpstop

Event time: Tue Feb 26 15:02:13 2019

Cluster failover count:

Failover counter: 4

Time of counter reset: Mon Feb 11 21:30:31 2019 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Mon Feb 11 21:30:31 2019):

No. Time: Transition: CPU: Reason:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1 Tue Feb 26 15:02:13 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

2 Tue Feb 26 13:49:52 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

3 Tue Feb 26 06:55:33 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

4 Mon Feb 25 16:40:45 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

_______________________________________________________________________________________________

FWINTRA2> show cluster failover

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: FULLSYNC PNOTE - cpstop

Event time: Tue Feb 26 15:02:13 2019

Cluster failover count:

Failover counter: 4

Time of counter reset: Mon Feb 11 21:30:31 2019 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Mon Feb 11 21:30:31 2019):

No. Time: Transition: CPU: Reason:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1 Tue Feb 26 15:02:13 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

2 Tue Feb 26 13:49:52 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

3 Tue Feb 26 06:55:33 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

4 Mon Feb 25 16:40:45 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

Environment:
Check Point's software version R80.20 - Build 255
kernel: R80.20 - Build 014
JHF Take: 17
OpenServer - Dell PowerEdge R730

Rafael_Lima1 · ‎2019-02-27

The problem has now occurred again, impacting our traffic.

Via cphaprob -a if no error is shown, via cpview we also do not notice the failover, but the traffic of some connections is interrupted.

Log in messages:
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-110300-2: State change: STANDBY -> DOWN | Reason: Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 2814224.1 and now 2814227.1
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-116505-2: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-110305-2: State change: ACTIVE! -> DOWN | Reason: Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-114802-2: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 1)

In "show cluster failover":

Wed Feb 27 13:31:04 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

Dave_Cullen · ‎2019-03-02

Hi Rafael, did you get any support on this?

I also have exactly the same issue on my R80.20 gateway.

I am going to try to install the latest jumbo to see if this makes any difference at all...

Rafael_Lima1 · ‎2019-03-06

Hi Dave,

We are with SR open, the last information that was requested was a tcpdump of the sync interface during the problem. We've sent the files and they're checking.
Did you have any progress after installing the latest JHF?

Ted_Czarnecki · ‎2019-03-06

What mode of CXL are You using - H/A or Load Sharing?

Rafael_Lima1 · ‎2019-03-07

Hi Ted,

Cluster mode is H/A.

Vladimir · ‎2019-03-06

Was the CCP running in broadcast or multicast before the upgrade to R80.20?

The 80.20 switching it to "Auto" and may elect a method that is different than the one you were using in a stable deployment on earlier version.

Rafael_Lima1 · ‎2019-03-07

Hello Vladimir,

Before it was in the default - unicast, we recently switched to broadcast, but the problem still remains. Yesterday we applied the JHF take 33 and we are tracking to see if the problem has been fixed.

Rafael_Lima1 · ‎2019-03-07

Hello, we made the application of JHF take 33 and enabled the Priority Queues, however, a new problem arose. The firewall is crashing in a few moments (it occurred on both gateways), and need to restart the server locally. During the problem no log messages appear, but we were following with CPVIEW and we saw that the CPUs overview was all in 100% use. At the moment we did not have a high number of connections.
Via command "show cluster failover" the following log appears:
1 Thu Mar 7 16:53:17 2019 Member 1 -> Member 2 00 Reboot
However, the server was only rebooted about 15 minutes later, because we had to go locally in the datacenter.

Lucas_Costa · ‎2019-11-28

Hello all,

Were you able to solve this issue ? I got the same thing in R80.20 with Hotfix 118:

Nov 26 11:03:29 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Nov 26 11:03:29 2019 UCUBFW01 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)
Nov 26 11:04:22 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Nov 26 11:04:24 2019 UCUBFW01 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)
Nov 26 11:04:26 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

Dmitry_Krupnik · ‎2019-11-29

Hello Lucas,

From log which you provided I see: "Reason: Interface Sync is down". Could you take a look at the Sync interface and related connection on both members? You also can check the error and drops related to the Sync interface in the output of "ifconfig" (from expert mode).

Lucas_Costa · ‎2019-12-11

Hi,

Thank you for update. Yes, this Sync interface it is directly connected between this two devices and have connectivty normally. We dont have errors in the interfaces.

Also have a case open with TAC, but no advance.

FedericoMeiners · ‎2019-11-30

Rafael,

Hope you are doing fine, I had similar issues in the past, not precisely after upgrading to R80.20. A few things to check:

- In my experience broadcast works the best on ClusterXL.

- Keep your sync network as small as possible (/29 or /30).

- Is eth5 your sync interface? Look for errors on the sync interfaces and always try to do a bond of two interfaces (etherchannel) for sync.

- How many cores do you have licensed for your open server? If possible try to assign dedicated cores to sync interface, if not try not to share much (ie: Only share it with mgmt interface).

I think that the real issue here is the CPU spikes which affect the ClusterXl sync process.

Hope it helps

____________
https://www.linkedin.com/in/federicomeiners/

Malopro · ‎2019-12-11

Rafael ...

Maybe the version ??

I began with ClusterXL implementation (replacing VRRP) on 77.30 and worked like a charm on both modes H/A initially and then Load Balancing using Unicast due ISP equipment limitation !!

I also use Virtual MAC method.

Recently I upgrade to R80.10 and still working ....

Using, like you a dedicated Sync interface simply connected by a cross UTP6 cable.

Warm regards

Darina2019 · ‎2020-02-19

Hello Community,

I am experiencing the same issue.

Once we enabled the IPS blade on our Checkpoint R80.20, three days laster dedicated Sync interface on 5800 appliance start flapping and receiving errors.

Today we changed the sync interface on free unused port and start using new cable.

After change Sync interface start flapping again and the same errors appear.

Then we disabled the IPS from CLI and IPS blade. So far no more failovers but still issues on the newly configured sync interface.

If someone already has the same issue and escalated it to support, is it possible to share any solution provided.

Thanks!

Are you a member of CheckMates?

Problem after migration to R80.20 - ClusterXL