cancel
Showing results for 
Search instead for 
Did you mean: 
Create a Post
Highlighted

Problem after migration to R80.20 - ClusterXL

After migrating from version R80.10 to version R80.20, our cluster presents the following messages.

Feb 25 16:40:45 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 25 16:40:46 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

Feb 26 06:55:33 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 26 06:55:33 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

Feb 26 13:49:52 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Feb 26 13:49:52 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved

In this cluster the backup traffic passes, causing a high consumption, before the migration we had the same consumption, but did not occur messages / errors.

Another thing, we are verifying a connectivity problem on our servers and the time is similar to that listed in the above messages. Can these messages identify traffic disruption? We have seen that it does not occur on all servers, but in the most sensitive the connection is interrupted, causing serious problems on servers that use NFS.

Another detail, we are getting the following message when executing the "show cluster failover" command, but we did not run the cpstop on the gateways

FWINTRA1> show cluster failover

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: FULLSYNC PNOTE - cpstop

Event time: Tue Feb 26 15:02:13 2019

Cluster failover count:

Failover counter: 4

Time of counter reset: Mon Feb 11 21:30:31 2019 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Mon Feb 11 21:30:31 2019):

No. Time: Transition: CPU: Reason:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1 Tue Feb 26 15:02:13 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

2 Tue Feb 26 13:49:52 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

3 Tue Feb 26 06:55:33 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

4 Mon Feb 25 16:40:45 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

_______________________________________________________________________________________________

FWINTRA2> show cluster failover

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: FULLSYNC PNOTE - cpstop

Event time: Tue Feb 26 15:02:13 2019

Cluster failover count:

Failover counter: 4

Time of counter reset: Mon Feb 11 21:30:31 2019 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Mon Feb 11 21:30:31 2019):

No. Time: Transition: CPU: Reason:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1 Tue Feb 26 15:02:13 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

2 Tue Feb 26 13:49:52 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

3 Tue Feb 26 06:55:33 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

4 Mon Feb 25 16:40:45 2019 Member 1 -> Member 2 00 FULLSYNC PNOTE - cpstop

Environment:
Check Point's software version R80.20 - Build 255
kernel: R80.20 - Build 014
JHF Take: 17
OpenServer - Dell PowerEdge R730

0 Kudos
11 Replies

Re: Problem after migration to R80.20 - ClusterXL

The problem has now occurred again, impacting our traffic.

Via cphaprob -a if no error is shown, via cpview we also do not notice the failover, but the traffic of some connections is interrupted.


Log in messages:
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-110300-2: State change: STANDBY -> DOWN | Reason: Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 2814224.1 and now 2814227.1
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-116505-2: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:04 2019 FWINTRA1 kernel: [fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-110305-2: State change: ACTIVE! -> DOWN | Reason: Interface eth5 is down (Cluster Control Protocol packets are not received)
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
Feb 27 13:31:05 2019 FWINTRA1 kernel: [fw4_1];CLUS-114802-2: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 1)

In "show cluster failover": 

 Wed Feb 27 13:31:04 2019  Member 1 -> Member 2  00    FULLSYNC PNOTE - cpstop

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hi Rafael, did you get any support on this?

I also have exactly the same issue on my R80.20 gateway.

I am going to try to install the latest jumbo to see if this makes any difference at all...

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hi Dave,

We are with SR open, the last information that was requested was a tcpdump of the sync interface during the problem. We've sent the files and they're checking.
Did you have any progress after installing the latest JHF?

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

What mode of CXL are You using - H/A or Load Sharing?

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hi Ted,

Cluster mode is H/A. 

0 Kudos
Vladimir
Pearl

Re: Problem after migration to R80.20 - ClusterXL

Was the CCP running in broadcast or multicast before the upgrade to R80.20?

The 80.20 switching it to "Auto" and may elect a method that is different than the one you were using in a stable deployment on earlier version.

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hello Vladimir,

Before it was in the default - unicast, we recently switched to broadcast, but the problem still remains. Yesterday we applied the JHF take 33 and we are tracking to see if the problem has been fixed.

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hello, we made the application of JHF take 33 and enabled the Priority Queues, however, a new problem arose. The firewall is crashing in a few moments (it occurred on both gateways), and need to restart the server locally. During the problem no log messages appear, but we were following with CPVIEW and we saw that the CPUs overview was all in 100% use. At the moment we did not have a high number of connections.
Via command "show cluster failover" the following log appears:
1 Thu Mar 7 16:53:17 2019 Member 1 -> Member 2 00 Reboot
However, the server was only rebooted about 15 minutes later, because we had to go locally in the datacenter.

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Hello all,

 

Were you able to solve this issue ? I got the same thing in R80.20 with Hotfix 118:

 

Nov 26 11:03:29 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Nov 26 11:03:29 2019 UCUBFW01 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)
Nov 26 11:04:22 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
Nov 26 11:04:24 2019 UCUBFW01 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)
Nov 26 11:04:26 2019 UCUBFW01 kernel: [fw4_1];CLUS-110300-1: State change: STANDBY -> DOWN | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

0 Kudos
Employee+
Employee+

Re: Problem after migration to R80.20 - ClusterXL

Hello Lucas,

From log which you provided I see: "Reason: Interface Sync is down". Could you take a look at the Sync interface and related connection on both members? You also can check the error and drops related to the Sync interface in the output of "ifconfig" (from expert mode).

0 Kudos

Re: Problem after migration to R80.20 - ClusterXL

Rafael,

Hope you are doing fine, I had similar issues in the past, not precisely after upgrading to R80.20. A few things to check:

- In my experience broadcast works the best on ClusterXL.

- Keep your sync network as small as possible (/29 or /30).

- Is eth5 your sync interface? Look for errors on the sync interfaces and always try to do a bond of two interfaces (etherchannel) for sync.

- How many cores do you have licensed for your open server? If possible try to assign dedicated cores to sync interface, if not try not to share much (ie: Only share it with mgmt interface).

I think that the real issue here is the CPU spikes which affect the ClusterXl sync process.

Hope it helps

____________
https://www.linkedin.com/in/federicomeiners/
0 Kudos