Solved: Re: Sync interface DOWN after reboot of Standby Me...

Parabol · ‎2023-08-08

Hi all, we have an issue with our VSX HA Cluster (Two gateways, Active/Standby), where after rebooting the Standby for whatever reason the Sync interface remains DOWN. In the past when this occurred, a physical power down of the standby restored the link, but a normal reboot does not (nor bouncing the link).

We're in the process of eliminating physical problems, particularly replacing the cable and SFP for this link. But I was wondering if there is any other troubleshooting steps I might be able to do in the mean time?

[ACTIVE] SYNC (eth3-04) <----> (eth3-04) SYNC [STANDBY]

Currently we have no HA resiliency, all VS are DOWN on the standby which isn't ideal.

Interface counters show no incrementing RX or TX on either side.

cphaprob syncstat does show incrementing SENT sync messages, but no received messages.

My theory is maybe the SFP/Transceiver is faulty, and perhaps in a normal reboot the SFP doesn't lose power, but in a full physical power down it does? Which maybe causes the link to come back up, I'm not sure..

I appreciate any thoughts!

Parabol · ‎2023-08-22

Hi all, to confirm it was a faulty SFP, so indeed a physical issue. The SFP was allowed to be RMA'd with Checkpoint, and the replacement SFP brought the link back online.

Thanks all for your assistance.

View solution in original post

MarcuzShinz · ‎2023-08-08

Could you perform command below and share me result:

cphaprob stat

cphaprob -a if

tcpdump -nni <name interface sync> port 8116

Parabol · ‎2023-08-08

Thanks for the reply Tron, please see below (I emitted some details like hostname/IP).

Even running tcpdump without port specified shows no packets at all on the interface.. so it seems the link is completely dead which makes me think it must be a physical issue.

Standby_Gateway:0> cphaprob stat

Cluster Mode: Virtual System Load Sharing (Primary Up)

ID Unique Address Assigned Load State Name

1 x.x.x.x 100% ACTIVE(!) Primary_Gateway
2 (local) x.x.x.x 0% DOWN Standby_Gateway

Active PNOTEs: IAC

Last member state change event:
Event Code: CLUS-110205
State change: ACTIVE(!) -> DOWN
Reason for state change: Interface eth3-04 is down (disconnected / link down)
Event time: Mon Aug 7 13:39:34 2023

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Available on member 1
Event time: Mon Aug 7 13:39:01 2023

Cluster failover count:
Failover counter: 7
Time of counter reset: Tue Sep 6 17:00:37 2022 (reboot)

Cluster name: Cluster

Virtual Devices Status on each Cluster Member
=============================================

ID | Weight| Primary | Standby
| | |
| | |
| | | [local]
-------+-------+-----------+-----------
2 | 10 | ACTIVE(!) | DOWN
3 | 10 | ACTIVE(!) | DOWN
---------------+-----------+-----------
Active | 2 | 0
Weight | 20 | 0
Weight (%) | 100 | 0

Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down

Standby_Gateway:0> cphaprob -a if

vsid 0:
------
CCP mode: Manual (Unicast)
Required interfaces: 1
Required secured interfaces: 0

Interface Name: Status:

eth1-01 UP
eth3-04 (S) DOWN (72062 secs)

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 1

eth1-01 x.x.x.x

[Expert@Standby_Gateway:0]# tcpdump -nni eth3-04 port 8116
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth3-04, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

MarcuzShinz · ‎2023-08-08

Thanks for your respone,

As information your provide, we can see:

Interface eth3-04 is down (disconnected / link down)

This causes the HAstatus to Alert DOWN. Let's check what this interface is, where this physical interface is connected, is it through any switches device?

Are there any previous changes?

Ruan_Kotze · ‎2023-08-08

How are the links cabled - are the gateways directly connected to each other (not recommended) or via a switch.

My preferred way is to have two sync interfaces in a non-LACP bond (eg. round robin works) going to two separate switches.

Parabol · ‎2023-08-08

True for sure there is switches between them, not directly connected.. so this could be a factor also.

MarcuzShinz · ‎2023-08-08

Dear Bro,

Please check status of physical interface or compare VLAN access for that interface.

CheckPointerXL · ‎2023-08-08

Why not recommended direct cable between FWs? In my opinion switch is an added point of failure

MarcuzShinz · ‎2023-08-08

I can think of this as user need. Because you can plug the cable directly between 2 devices as long as both things are in the same rack.

If both devices are located in 2 different racks, then plugging through the switch will create aesthetics and make it easier to change cables when there is a problem in the physical layer.

Ruan_Kotze · ‎2023-08-08

First off - there is Check Point's guidance on supported topologies for the sync network. Note how on all there is a switch specified.

I could build out a couple of failure scenarios - but @Bob_Zimmerman has already done a better job of it than what I can on this CheckMates post here.

If you are concerned about a switch being a single point of failure, then likely it is a SPOF for other things in your environment as well. Solve this issue with two sync interfaces in a non-LACP bond (eg. round robin works) going to two separate switches.

Parabol · ‎2023-08-22

Hi all, to confirm it was a faulty SFP, so indeed a physical issue. The SFP was allowed to be RMA'd with Checkpoint, and the replacement SFP brought the link back online.

Thanks all for your assistance.

MarcuzShinz · ‎2023-08-22

It is good news =))

Are you a member of CheckMates?

Sync interface DOWN after reboot of Standby Member - any other TShoot options?