Repeated link up/down for 15 minutes after connect...

Michio2000 · ‎2023-05-23

Hello.

environment:

CPAP-SG3100-NGTP x2

Product version Check Point Gaia R80.20

OS build 101

OS kernel version 2.6.18-92cpx86_64

OS edition 64-bit

I have a cluster configuration with 2 devices.

The memory usage rate of Unit 2 became high, so I reboot it.

I had to pull out all the cables for Unit 2, so I pulled them all out.

After rebooting, I connected one HA(eth4) cable, but it repeated link up/down for 15 minutes.

I would like to know if this is the spec.

Network Interface:

Mgmt Private (Non-Monitored)
bond0
eth1
eth2
eth3
eth4 HA-1
eth5 HA-2

Unit 2 /var/log/messeage:

May 17 09:36:35 2023 XXXXXXXXX-02 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
May 17 09:36:36 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status up for interface eth4, enabling it in 200 ms.
May 17 09:36:36 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status definitely up for interface eth4.
May 17 09:36:36 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-114205-2: State change: ACTIVE! -> STANDBY | Reason: Member state has been changed due to higher priority of remote cluster member 1 in PRIMARY-UP cluster
May 17 09:36:36 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
May 17 09:36:36 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110200-2: State change: STANDBY -> DOWN | Reason: Interface eth1 is down (disconnected / link down)
May 17 09:36:37 2023 XXXXXXXXX-02 kernel: igb: eth4 NIC Link is Down
May 17 09:36:38 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status down for idle interface eth4, disabling it in 200 ms.
May 17 09:36:38 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110200-2: State remains: DOWN | Reason: Previous problem resolved, Interface bond0 is down (disconnected / link down)
May 17 09:36:38 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status definitely down for interface eth4, disabling it
May 17 09:36:38 2023 XXXXXXXXX-02 xpand[5070]: Configuration changed from localhost by user admin by the service dbset
May 17 09:36:39 2023 XXXXXXXXX-02 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 381.2 and now 384.2
May 17 09:36:40 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
May 17 09:36:40 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-116505-2: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface bond0 is down (disconnected / link down)
May 17 09:36:40 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
May 17 09:36:40 2023 XXXXXXXXX-02 xpand[5070]: Configuration changed from localhost by user admin by the service dbset
May 17 09:36:40 2023 XXXXXXXXX-02 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
May 17 09:36:40 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status up for interface eth4, enabling it in 200 ms.
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status definitely up for interface eth4.
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110205-2: State remains: ACTIVE! | Reason: Interface eth1 is down (disconnected / link down)
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110205-2: State change: ACTIVE! -> DOWN | Reason: Interface bond0 is down (disconnected / link down)
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110200-2: State remains: DOWN | Reason: Previous problem resolved, Interface eth1 is down (disconnected / link down)
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status down for idle interface eth4, disabling it in 200 ms.
May 17 09:36:41 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110200-2: State remains: DOWN | Reason: Previous problem resolved, Interface bond0 is down (disconnected / link down)
May 17 09:36:42 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status definitely down for interface eth4, disabling it
May 17 09:36:42 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110200-2: State remains: DOWN | Reason: Previous problem resolved, Interface eth1 is down (disconnected / link down)
May 17 09:36:42 2023 XXXXXXXXX-02 xpand[5070]: Configuration changed from localhost by user admin by the service dbset
May 17 09:36:43 2023 XXXXXXXXX-02 kernel: igb: eth4: igb_setup_mrqc: Setting Legacy RSS (Asymmetric
May 17 09:36:44 2023 XXXXXXXXX-02 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 386.2 and now 389.2
May 17 09:36:44 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-216400-2: Remote member 1 (state ACTIVE -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
May 17 09:36:44 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-116505-2: State change: DOWN -> ACTIVE(!) | Reason: All other machines are dead (timeout), Interface eth1 is down (disconnected / link down)
May 17 09:36:44 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
May 17 09:36:45 2023 XXXXXXXXX-02 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
May 17 09:36:45 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status up for interface eth4, enabling it in 200 ms.
May 17 09:36:45 2023 XXXXXXXXX-02 kernel: bonding: bond0: link status definitely up for interface eth4.
May 17 09:36:46 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-110205-2: State change: ACTIVE! -> DOWN | Reason: Interface eth1 is down (disconnected / link down)
May 17 09:36:46 2023 XXXXXXXXX-02 kernel: [fw4_1];CLUS-214904-2: Remote member 1 (state LOST -> ACTIVE) | Reason: Reason for ACTIVE! alert has been resolved
May 17 09:36:47 2023 XXXXXXXXX-02 kernel: igb: eth4 NIC Link is Down

_Val_ · ‎2023-05-23

R80.20 is out of support for a long time, just to be sure.

Before anything else, please check that you do not have another Check Point cluster on the same network. Check IGMP snooping is disabled on the switch for all cluster ports, and also make sure your bond is correctly configured on both CP and network sides.

Michio2000 · ‎2023-05-24

Thanks for the reply. Thank you very much.

I know the version I'm using is no longer supported.

But I don't have a test environment, so I can't upgrade immediately.

No other CP products were used in this system.

It uses IGMP to allow multicast communication.

_Val_ · ‎2023-05-24

Disable IGMP snooping on the swich side and see if it makes a difference. Also, post here your bond config from CP

Michio2000 · ‎2023-05-24

Unit 1:

Network Interface Infomation:

bond0 Type:Bond IPv4 Address:192.168.20.5 Subnet mask:255.255.255.252
bond0 interface:eth4, eth5 Operation Mode:802.3ad Transmit Hash Policy:Layer 3+4 LACP Rate:Slow

show configuration:

add bonding group 0
add bonding group 0 interface eth4
add bonding group 0 interface eth5
set bonding group 0 mode 8023AD
set bonding group 0 lacp-rate slow
set bonding group 0 mii-interval 100
set bonding group 0 down-delay 200
set bonding group 0 up-delay 200
set bonding group 0 xmit-hash-policy layer3+4

set interface bond0 state on
set interface bond0 mtu 1500
set interface bond0 ipv4-address 192.168.20.5 mask-length 30

Unit 2:

192.168.20.6

The device is in operation and cannot be changed immediately.

No link up/down now.

I would like to know why the link up/down happened for about 15 minutes.

the_rock · ‎2023-05-24

Your best bet is to check smart console logs, as well as /var/log/messages* files

You can also try something like below (example from my lab)

grep -i DOWN /var/log/messages*

Andy

the_rock · ‎2023-05-24

Check if virtual MAC option is enabled on cluster properties object under clusterxl, as well as send below from both members.

Andy

cphaprob roles

cphaprob state

cphaprob syncstat

cphaprob list

cphaprob -a if

Michio2000 · ‎2023-05-29

Thanks for the reply. Thank you very much.

> Check if virtual MAC option is enabled on cluster properties object under clusterxl

I confirmed your advice, but "Use Virtual MAC" is not checked.

Cphaprob Infomation is below;

XXXXXXXXX-01> cphaprob roles

ID Role

1 (local) Master
2 Non-Master

XXXXXXXXX-01> cphaprob state

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 192.168.20.5 100% ACTIVE XXXXXXXXX-01
2 192.168.20.6 0% STANDBY XXXXXXXXX-02

Active PNOTEs: None

Last member state change event:
Event Code: CLUS-114904
State change: ACTIVE(!) -> ACTIVE
Reason for state change: Reason for ACTIVE! alert has been resolved
Event time: Wed May 17 09:52:38 2023

Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: Member state has been changed due to higher priority of remote cluster member 1 in PRIMARY-UP cluster
Event time: Thu Jun 3 13:53:07 2021

Cluster failover count:
Failover counter: 24
Time of counter reset: Fri Dec 13 15:33:47 2019 (reboot)

XXXXXXXXX-01> cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 33006557
Sent retransmission requests................. 0
Sent retransmission updates.................. 0
Peak fragments per update.................... 1

Received messages:
Total received updates....................... 14217563
Received retransmission requests............. 0

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Thu Jun 3 13:53:07 2021 (triggered by fullsync).

XXXXXXXXX-01> cphaprob list

There are no pnotes in problem state

XXXXXXXXX-01> cphaprob -a if

CCP mode: Automatic
Required interfaces: 4
Required secured interfaces: 1

eth1 UP non sync(non secured), unicast
eth2 UP non sync(non secured), unicast
eth3 UP non sync(non secured), unicast
Mgmt Non-Monitored non sync(non secured)
bond0 UP sync(secured), unicast, bond Load Sharing

Virtual cluster interfaces: 3

eth1 172.29.13X.3X
eth2 172.29.12X.17X
eth3 172.29.12X1.25X

XXXXXXXXX-02> cphaprob roles

ID Role

1 Master
2 (local) Non-Master

XXXXXXXXX-02> cphaprob state

Cluster Mode: High Availability (Primary Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.20.5 100% ACTIVE npaknwcfw-01
2 (local) 192.168.20.6 0% STANDBY XXXXXXXXX-02

Active PNOTEs: None

Last member state change event:
Event Code: CLUS-114802
State change: DOWN -> STANDBY
Reason for state change: There is already an ACTIVE member in the cluster (member 1)
Event time: Wed May 17 10:29:13 2023

Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: Member state has been changed due to higher priority of remote cluster member 1 in PRIMARY-UP cluster
Event time: Thu Jun 3 13:53:07 2021

Cluster failover count:
Failover counter: 24
Time of counter reset: Fri Dec 13 15:33:47 2019 (reboot)

XXXXXXXXX-02> cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 245550
Sent retransmission requests................. 0
Sent retransmission updates.................. 0
Peak fragments per update.................... 1

Received messages:
Total received updates....................... 466544
Received retransmission requests............. 0

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Wed May 17 09:33:55 2023 (triggered by fullsync).

XXXXXXXXX-02> cphaprob list

There are no pnotes in problem state

XXXXXXXXX-02> cphaprob -a if

CCP mode: Automatic
Required interfaces: 4
Required secured interfaces: 1

eth1 UP non sync(non secured), unicast
eth2 UP non sync(non secured), unicast
eth3 UP non sync(non secured), unicast
Mgmt Non-Monitored non sync(non secured)
bond0 UP sync(secured), unicast, bond Load Sharing

Virtual cluster interfaces: 3

eth1 172.29.13X.3X
eth2 172.29.12X.17X
eth3 172.29.12X.25X

the_rock · ‎2023-05-29

Based on that output, all looks right to me. Just a small suggestion...when it comes to sync interface, I always tell people to use something from 169.254.x.x subnet, as thats totally non routable and there is literally zero chance any of those IPs would be used in your network.

Anyway, having said that, what you sent looks right. Did you confirm link state on fw side for bond interface? What about the switch?

Andy

Timothy_Hall · ‎2023-05-29

The syslog entries would seem to indicate that the link integrity (green light) on the interface was repeatedly lost for a short period (<1sec), which then caused ClusterXL to mark the interface as down. An interface outage of this short duration is generally caused by a loose cable or speed/duplex negotiation flap. If you haven't rebooted the gateway since the incident, the output of ethtool -S eth4 may shed some light. The logs on the switch around the time of the flap might be helpful too.

I don't think a STP issue on the switch will actually drop link integrity when it stops forwarding traffic due to a possible bridging loop on that switchport, and I don't think switch broadcast suppression/storm control would actually drop link either but I could be wrong unless it was some kind of errdisable.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Are you a member of CheckMates?

Repeated link up/down for 15 minutes after connecting HA cable