Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
saitoh
Collaborator
Jump to solution

standby member of clusterxl sometimes goes down for a few second

Hi all,

 

This is urgent and any comments are more than welcome.

 

Env:

2GWs(9100), R81.20T99, ClusterXL, High Availability of ClusterXL, not preempt

6 physical interfaces are grouped into 3 bond interfaces

3 802.3ad bond interfaces as; WAN(eth1-01, 02), LAN(eth1-03, 04), Sync(eth1, eth2)

If one bond interface not available failover is configured to be triggered

 

Advanced settings of Bond set as default

additional NICs are at the link speed of 10GB

 

I was seeing if failover functionality works as configured.

To test it, I unplugged a certain cable one by one, and check cluster state by # cphaprob state.

After that, Unplugging cables included in one bond interface follows in order to make it go down,

and observed whether failover occurred or not.

 

Most of the tests went good.

However, I observed standby member occasionally go down for a few seconds

when only one of the member interface of bond interface.

It went back to the state of standby quickly.

It takes the cluster nearly 20 seconds to change its state from active - down to standby

when making it failover by unplugging all cables from physical member of a certain bond interface.

This is against my understanding.

 

My hypothesis is: LACP link failover time and ClusterXL failover time have bad timing.

Your thoughts?

 

Saitoh

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
3 Solutions

Accepted Solutions
the_rock
Legend
Legend

Hey Saitoh,

First off, if its urgent, I suggest calling TAC and opening a case to see if you can speak with someone. Second, would you mind sending outputs of below commands from both members?

cphaprob roles

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Andy

View solution in original post

(1)
AkosBakos
Mentor Mentor
Mentor

How many members have  you got in the LACP BOND?

What does cat /proc/net/bondig/bond<X> say?

All things are ok? What is the "churned" state?

----------------
\m/_(>_<)_\m/

View solution in original post

(1)
saitoh
Collaborator

Dear all who helped me figure out what the problem is,

 

I finished the investigation, and succeeded in making it clear what causes this behaviour.

It was just the problem of switches, they were not configured to recognise LACP! 😞

No port-channel on the switches, no surprise. Nothing technical to intrigue you all. Sadge.

 

Anyways my wholehearted appreciation reaches @the_rock , @Chris_Atkinson , @AkosBakos ,

who instructed me how I should check the status associated with 802.3ad.

I was able to sort out the problem and asked admin to make switches ready thanks to you all.

 

Saitoh

sliver bullet: casting repero or tossing it into the harbor

View solution in original post

0 Kudos
12 Replies
the_rock
Legend
Legend

Hey Saitoh,

First off, if its urgent, I suggest calling TAC and opening a case to see if you can speak with someone. Second, would you mind sending outputs of below commands from both members?

cphaprob roles

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

Andy

(1)
saitoh
Collaborator

Dear @the_rock ,

 

Thanks for your comments.

Outputs are listed below, with some point masked for the purpose of privacy.

I cannot reach the cluster since it is in production, and therefore I took them in lab environment, same scenario.

GW2> cphaprob roles

ID         Role

1          Master
2 (local)  Non-Master

GW2> cphaprob state


Cluster Mode:   High Availability (Active Up) with IGMP Membership

ID         Unique Address  Assigned Load   State          Name                                

1          xxx.xxx.xxx.65    100%            ACTIVE         GW1
2 (local)  xxx.xxx.xxx.66    0%              STANDBY        GW2


Active PNOTEs: None

Last member state change event:
   Event Code:                 CLUS-114802
   State change:               DOWN -> STANDBY
   Reason for state change:    There is already an ACTIVE member in the cluster (member 1)
   Event time:                 Fri May  9 18:03:38 2025

Last cluster failover event:
   Transition to new ACTIVE:   Member 2 -> Member 1
   Reason:                     Interface bond1 is down (disconnected / link down)
   Event time:                 Fri May  9 18:01:37 2025

Cluster failover count:
   Failover counter:           9
   Time of counter reset:      Fri May  9 14:51:04 2025 (reboot)


GW2>
GW2> cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 3
Required secured interfaces: 1


Interface Name:      Status:

bond1 (LS)           UP
bond2 (LS)           UP
bond3 (S-LS)         UP

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 2

bond1           xxx.xxx.xxx.30
bond2           xxx.xxx.xxx.254

GW2> cphaprob -i list


There are no pnotes in problem state

GW2>
GW2> cphaprob -l list

Built-in Devices:

Device Name: Interface Active Check
Current state: OK

Device Name: Recovery Delay
Current state: OK

Device Name: CoreXL Configuration
Current state: OK

Registered Devices:

Device Name: Fullsync
Registration number: 0
Timeout: none
Current state: OK
Time since last report: 293476 sec

Device Name: Policy
Registration number: 1
Timeout: none
Current state: OK
Time since last report: 293476 sec

Device Name: routed
Registration number: 2
Timeout: none
Current state: OK
Time since last report: 287426 sec

Device Name: cxld
Registration number: 3
Timeout: 30 sec
Current state: OK
Time since last report: 293502 sec
Process Status: UP

Device Name: fwd
Registration number: 4
Timeout: 30 sec
Current state: OK
Time since last report: 293502 sec
Process Status: UP

Device Name: cphad
Registration number: 5
Timeout: 30 sec
Current state: OK
Time since last report: 293497 sec
Process Status: UP

Device Name: Init
Registration number: 6
Timeout: none
Current state: OK
Time since last report: 293492 sec

Device Name: Local Probing
Registration number: 7
Timeout: none
Current state: OK
Time since last report: 286947 sec

Device Name: DSD
Registration number: 8
Timeout: none
Current state: OK
Time since last report: 293475 sec

GW2> cphaprob syncstat


Delta Sync Statistics

Sync status: OK

Drops:
Lost updates.................................  0
Lost bulk update events......................  0
Oversized updates not sent...................  0

Sync at risk:
Sent reject notifications....................  0
Received reject notifications................  0

Sent messages:
Total generated sync messages................  55656
Sent retransmission requests.................  10
Sent retransmission updates..................  42
Peak fragments per update....................  1

Received messages:
Total received updates.......................  33658
Received retransmission requests.............  10

Sync Interface:
Name.........................................  bond3
Link speed...................................  2000Mb/s
Rate.........................................  15790 [Bps]
Peak rate....................................  15790 [Bps]
Link usage...................................   0%
Total........................................  4422  [MB]

Queue sizes (num of updates):
Sending queue size...........................  512
Receiving queue size.........................  256
Fragments queue size.........................  50

Timers:
Delta Sync interval (ms).....................  100

Reset on Fri May  9 16:14:48 2025 (triggered by fullsync).
sliver bullet: casting repero or tossing it into the harbor
0 Kudos
the_rock
Legend
Legend

Looks okay to me. Can you see if what I attached matches?

Andy

 

0 Kudos
saitoh
Collaborator

Much appreciated to further comment!

I can confirm those settings are same as the screenshot says.

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
the_rock
Legend
Legend

This might need further investigation...did you open TAC case for it?

Andy

0 Kudos
AkosBakos
Mentor Mentor
Mentor

How many members have  you got in the LACP BOND?

What does cat /proc/net/bondig/bond<X> say?

All things are ok? What is the "churned" state?

----------------
\m/_(>_<)_\m/
(1)
saitoh
Collaborator

Dear @AkosBakos ,

 

Thanks for your comments, each bond has two physical interface members.

I will check them in the morning. I cannot reach its cluster is in production, located in data center.

Would you mind telling me what you are concerned?

I am not getting used to dealing with LACP, so it would be so much appreciated if you enlighten me.

 

 

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
Chris_Atkinson
Employee Employee
Employee

How quickly are you plugging / unplugging cables - how is portfast configured here?

CCSM R77/R80/ELITE
(1)
saitoh
Collaborator

Dear @Chris_Atkinson ,

 

I am sorry that I forgot to mention it. It is set to fast.

Un/pluggings were done in a normal manner.

Not quickly, but not strugglingly.

 

What kind of way could lead to this behaviour for example?

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
saitoh
Collaborator

P.S.

It was always standby member which went down, not active one.

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
saitoh
Collaborator

Dear all who helped me figure out what the problem is,

 

I finished the investigation, and succeeded in making it clear what causes this behaviour.

It was just the problem of switches, they were not configured to recognise LACP! 😞

No port-channel on the switches, no surprise. Nothing technical to intrigue you all. Sadge.

 

Anyways my wholehearted appreciation reaches @the_rock , @Chris_Atkinson , @AkosBakos ,

who instructed me how I should check the status associated with 802.3ad.

I was able to sort out the problem and asked admin to make switches ready thanks to you all.

 

Saitoh

sliver bullet: casting repero or tossing it into the harbor
0 Kudos
the_rock
Legend
Legend

Gteat job @saitoh 

Andy

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events