Solved: Re: standby member of clusterxl sometimes goes dow...

saitoh · ‎2025-05-12

Hi all,

This is urgent and any comments are more than welcome.

Env:

2GWs(9100), R81.20T99, ClusterXL, High Availability of ClusterXL, not preempt

6 physical interfaces are grouped into 3 bond interfaces

3 802.3ad bond interfaces as; WAN(eth1-01, 02), LAN(eth1-03, 04), Sync(eth1, eth2)

If one bond interface not available failover is configured to be triggered

Advanced settings of Bond set as default

additional NICs are at the link speed of 10GB

I was seeing if failover functionality works as configured.

To test it, I unplugged a certain cable one by one, and check cluster state by # cphaprob state.

After that, Unplugging cables included in one bond interface follows in order to make it go down,

and observed whether failover occurred or not.

Most of the tests went good.

However, I observed standby member occasionally go down for a few seconds

when only one of the member interface of bond interface.

It went back to the state of standby quickly.

It takes the cluster nearly 20 seconds to change its state from active - down to standby

when making it failover by unplugging all cables from physical member of a certain bond interface.

This is against my understanding.

My hypothesis is: LACP link failover time and ClusterXL failover time have bad timing.

Your thoughts?

Saitoh

sliver bullet: casting repero or tossing it into the harbor

the_rock · ‎2025-05-12

Hey Saitoh,

First off, if its urgent, I suggest calling TAC and opening a case to see if you can speak with someone. Second, would you mind sending outputs of below commands from both members?

cphaprob roles

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Andy

View solution in original post

AkosBakos · ‎2025-05-12

How many members have you got in the LACP BOND?

What does cat /proc/net/bondig/bond<X> say?

All things are ok? What is the "churned" state?

----------------
\m/_(>_<)_\m/

View solution in original post

saitoh · ‎2025-05-13

Dear all who helped me figure out what the problem is,

I finished the investigation, and succeeded in making it clear what causes this behaviour.

It was just the problem of switches, they were not configured to recognise LACP! 😞

No port-channel on the switches, no surprise. Nothing technical to intrigue you all. Sadge.

Anyways my wholehearted appreciation reaches @the_rock , @Chris_Atkinson , @AkosBakos ,

who instructed me how I should check the status associated with 802.3ad.

I was able to sort out the problem and asked admin to make switches ready thanks to you all.

Saitoh

sliver bullet: casting repero or tossing it into the harbor

View solution in original post

the_rock · ‎2025-05-12

Hey Saitoh,

First off, if its urgent, I suggest calling TAC and opening a case to see if you can speak with someone. Second, would you mind sending outputs of below commands from both members?

cphaprob roles

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Andy

saitoh · ‎2025-05-12

Dear @the_rock ,

Thanks for your comments.

Outputs are listed below, with some point masked for the purpose of privacy.

I cannot reach the cluster since it is in production, and therefore I took them in lab environment, same scenario.

GW2> cphaprob roles

ID Role

1 Master

2 (local) Non-Master

GW2> cphaprob state

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 xxx.xxx.xxx.65 100% ACTIVE GW1

2 (local) xxx.xxx.xxx.66 0% STANDBY GW2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114802

State change: DOWN -> STANDBY

Reason for state change: There is already an ACTIVE member in the cluster (member 1)

Event time: Fri May 9 18:03:38 2025

Last cluster failover event:

Transition to new ACTIVE: Member 2 -> Member 1

Reason: Interface bond1 is down (disconnected / link down)

Event time: Fri May 9 18:01:37 2025

Cluster failover count:

Failover counter: 9

Time of counter reset: Fri May 9 14:51:04 2025 (reboot)

GW2>

GW2> cphaprob -a if

CCP mode: Manual (Unicast)

Required interfaces: 3

Required secured interfaces: 1

Interface Name: Status:

bond1 (LS) UP

bond2 (LS) UP

bond3 (S-LS) UP

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 2

bond1 xxx.xxx.xxx.30

bond2 xxx.xxx.xxx.254

GW2> cphaprob -i list

There are no pnotes in problem state

GW2>

GW2> cphaprob -l list

Built-in Devices:

Device Name: Interface Active Check

Current state: OK

Device Name: Recovery Delay

Current state: OK

Device Name: CoreXL Configuration

Current state: OK

Registered Devices:

Device Name: Fullsync

Registration number: 0

Timeout: none

Current state: OK

Time since last report: 293476 sec

Device Name: Policy

Registration number: 1

Timeout: none

Current state: OK

Time since last report: 293476 sec

Device Name: routed

Registration number: 2

Timeout: none

Current state: OK

Time since last report: 287426 sec

Device Name: cxld

Registration number: 3

Timeout: 30 sec

Current state: OK

Time since last report: 293502 sec

Process Status: UP

Device Name: fwd

Registration number: 4

Timeout: 30 sec

Current state: OK

Time since last report: 293502 sec

Process Status: UP

Device Name: cphad

Registration number: 5

Timeout: 30 sec

Current state: OK

Time since last report: 293497 sec

Process Status: UP

Device Name: Init

Registration number: 6

Timeout: none

Current state: OK

Time since last report: 293492 sec

Device Name: Local Probing

Registration number: 7

Timeout: none

Current state: OK

Time since last report: 286947 sec

Device Name: DSD

Registration number: 8

Timeout: none

Current state: OK

Time since last report: 293475 sec

GW2> cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:

Lost updates................................. 0

Lost bulk update events...................... 0

Oversized updates not sent................... 0

Sync at risk:

Sent reject notifications.................... 0

Received reject notifications................ 0

Sent messages:

Total generated sync messages................ 55656

Sent retransmission requests................. 10

Sent retransmission updates.................. 42

Peak fragments per update.................... 1

Received messages:

Total received updates....................... 33658

Received retransmission requests............. 10

Sync Interface:

Name......................................... bond3

Link speed................................... 2000Mb/s

Rate......................................... 15790 [Bps]

Peak rate.................................... 15790 [Bps]

Link usage................................... 0%

Total........................................ 4422 [MB]

Queue sizes (num of updates):

Sending queue size........................... 512

Receiving queue size......................... 256

Fragments queue size......................... 50

Timers:

Delta Sync interval (ms)..................... 100

Reset on Fri May 9 16:14:48 2025 (triggered by fullsync).

sliver bullet: casting repero or tossing it into the harbor

the_rock · ‎2025-05-12

Looks okay to me. Can you see if what I attached matches?

Andy

saitoh · ‎2025-05-12

Much appreciated to further comment!

I can confirm those settings are same as the screenshot says.

sliver bullet: casting repero or tossing it into the harbor

the_rock · ‎2025-05-12

This might need further investigation...did you open TAC case for it?

Andy

AkosBakos · ‎2025-05-12

How many members have you got in the LACP BOND?

What does cat /proc/net/bondig/bond<X> say?

All things are ok? What is the "churned" state?

----------------
\m/_(>_<)_\m/

saitoh · ‎2025-05-12

Dear @AkosBakos ,

Thanks for your comments, each bond has two physical interface members.

I will check them in the morning. I cannot reach its cluster is in production, located in data center.

Would you mind telling me what you are concerned?

I am not getting used to dealing with LACP, so it would be so much appreciated if you enlighten me.

sliver bullet: casting repero or tossing it into the harbor

Chris_Atkinson · ‎2025-05-12

How quickly are you plugging / unplugging cables - how is portfast configured here?

CCSM R77/R80/ELITE

saitoh · ‎2025-05-12

Dear @Chris_Atkinson ,

I am sorry that I forgot to mention it. It is set to fast.

Un/pluggings were done in a normal manner.

Not quickly, but not strugglingly.

What kind of way could lead to this behaviour for example?

sliver bullet: casting repero or tossing it into the harbor

saitoh · ‎2025-05-12

P.S.

It was always standby member which went down, not active one.

sliver bullet: casting repero or tossing it into the harbor

saitoh · ‎2025-05-13

Dear all who helped me figure out what the problem is,

I finished the investigation, and succeeded in making it clear what causes this behaviour.

It was just the problem of switches, they were not configured to recognise LACP! 😞

No port-channel on the switches, no surprise. Nothing technical to intrigue you all. Sadge.

Anyways my wholehearted appreciation reaches @the_rock , @Chris_Atkinson , @AkosBakos ,

who instructed me how I should check the status associated with 802.3ad.

I was able to sort out the problem and asked admin to make switches ready thanks to you all.

Saitoh

sliver bullet: casting repero or tossing it into the harbor

the_rock · ‎2025-05-13

Gteat job @saitoh

Andy

Are you a member of CheckMates?

standby member of clusterxl sometimes goes down for a few second