Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Parabol
Contributor
Jump to solution

Sync interface DOWN after reboot of Standby Member - any other TShoot options?

Hi all, we have an issue with our VSX HA Cluster (Two gateways, Active/Standby), where after rebooting the Standby for whatever reason the Sync interface remains DOWN. In the past when this occurred, a physical power down of the standby restored the link, but a normal reboot does not (nor bouncing the link).

We're in the process of eliminating physical problems, particularly replacing the cable and SFP for this link. But I was wondering if there is any other troubleshooting steps I might be able to do in the mean time?

[ACTIVE] SYNC (eth3-04) <----> (eth3-04) SYNC [STANDBY]

Currently we have no HA resiliency, all VS are DOWN on the standby which isn't ideal.

Interface counters show no incrementing RX or TX on either side.

cphaprob syncstat does show incrementing SENT sync messages, but no received messages.

My theory is maybe the SFP/Transceiver is faulty, and perhaps in a normal reboot the SFP doesn't lose power, but in a full physical power down it does? Which maybe causes the link to come back up, I'm not sure..

I appreciate any thoughts!

 

 

0 Kudos
1 Solution

Accepted Solutions
Parabol
Contributor

Hi all, to confirm it was a faulty SFP, so indeed a physical issue. The SFP was allowed to be RMA'd with Checkpoint, and the replacement SFP brought the link back online.

Thanks all for your assistance.

View solution in original post

11 Replies
TronNQ
Participant

Could you perform command below and share me result:

cphaprob stat

cphaprob -a if

tcpdump -nni <name interface sync> port 8116

0 Kudos
Parabol
Contributor

Thanks for the reply Tron, please see below (I emitted some details like hostname/IP).

Even running tcpdump without port specified shows no packets at all on the interface.. so it seems the link is completely dead which makes me think it must be a physical issue. 

 

Standby_Gateway:0> cphaprob stat

Cluster Mode: Virtual System Load Sharing (Primary Up)

ID Unique Address Assigned Load State Name

1 x.x.x.x 100% ACTIVE(!) Primary_Gateway
2 (local) x.x.x.x 0% DOWN Standby_Gateway


Active PNOTEs: IAC

Last member state change event:
Event Code: CLUS-110205
State change: ACTIVE(!) -> DOWN
Reason for state change: Interface eth3-04 is down (disconnected / link down)
Event time: Mon Aug 7 13:39:34 2023

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Available on member 1
Event time: Mon Aug 7 13:39:01 2023

Cluster failover count:
Failover counter: 7
Time of counter reset: Tue Sep 6 17:00:37 2022 (reboot)


Cluster name: Cluster

Virtual Devices Status on each Cluster Member
=============================================

ID | Weight| Primary | Standby
| | |
| | |
| | | [local]
-------+-------+-----------+-----------
2 | 10 | ACTIVE(!) | DOWN
3 | 10 | ACTIVE(!) | DOWN
---------------+-----------+-----------
Active | 2 | 0
Weight | 20 | 0
Weight (%) | 100 | 0

Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down

 


Standby_Gateway:0> cphaprob -a if

vsid 0:
------
CCP mode: Manual (Unicast)
Required interfaces: 1
Required secured interfaces: 0


Interface Name: Status:

eth1-01 UP
eth3-04 (S) DOWN (72062 secs)

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 1

eth1-01 x.x.x.x

 


[Expert@Standby_Gateway:0]# tcpdump -nni eth3-04 port 8116
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth3-04, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

 

 

 

 

0 Kudos
TronNQ
Participant

Thanks for your respone, 

As information your provide, we can see:

Interface eth3-04 is down (disconnected / link down)

This causes the HAstatus to Alert DOWN. Let's check what this interface is, where this physical interface is connected, is it through any switches device?

Are there any previous changes?

0 Kudos
Ruan_Kotze
Advisor

How are the links cabled - are the gateways directly connected to each other (not recommended) or via a switch.

My preferred way is to have two sync interfaces in a non-LACP bond (eg. round robin works) going to two separate switches.

0 Kudos
Parabol
Contributor

True for sure there is switches between them, not directly connected.. so this could be a factor also.

0 Kudos
TronNQ
Participant

Dear Bro,

Please check status of physical interface or compare VLAN access for that interface.

0 Kudos
CheckPointerXL
Advisor

Why not recommended direct cable between FWs? In my opinion switch is an added point of failure

0 Kudos
TronNQ
Participant

I can think of this as user need. Because you can plug the cable directly between 2 devices as long as both things are in the same rack.

If both devices are located in 2 different racks, then plugging through the switch will create aesthetics and make it easier to change cables when there is a problem in the physical layer.

0 Kudos
Ruan_Kotze
Advisor

First off - there is Check Point's guidance on supported topologies for the sync network.  Note how on all there is a switch specified.

I could build out a couple of failure scenarios - but @Bob_Zimmerman has already done a better job of it than what I can on this CheckMates post here.

If you are concerned about a switch being a single point of failure, then likely it is a SPOF for other things in your environment as well.  Solve this issue with two sync interfaces in a non-LACP bond (eg. round robin works) going to two separate switches.

0 Kudos
Parabol
Contributor

Hi all, to confirm it was a faulty SFP, so indeed a physical issue. The SFP was allowed to be RMA'd with Checkpoint, and the replacement SFP brought the link back online.

Thanks all for your assistance.

TronNQ
Participant

It is good news =))

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events