Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Contributor

Cluster Falover after every change on pbr, routing, ospf,...

Jump to solution

Hello,

I have two 5600 Appliances runninh in HA on R80.30 , but bevavoiur was already on 80.10.

Everytime I do a change in PBR Settings (adding Table etc),  add an interface, or add an ospf route distribution, the node on which I work gets degraded to down. If I do this on the primary node, a failover occurs.

 

 

May  7 10:08:22 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May  7 10:08:22 2020 DETKDUSIPS09 kernel: [fw4_1];[xxxxxxxxx:46770 -> xxxxxxxxxxxx] [ERROR]: cmik_loader_fw_context_match_cb: match_cb for CMI APP 3 failed on context 56, executing context 366 and adding the app to apps in exception

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE:  task_cmd_init(143): command subsystem initialized.

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE:  Start routed[27001] version routed-12.30.2019-11:21:08 instance 0

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: routed_syslog_on: tracing to "/var/log/routed_messages" started

May  7 10:08:23 2020 DETKDUSIPS09 kernel: Passive ARP hook already uninstalled!

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];Global param: set int fwha_cbs_which_member_is_running_gated to '0'

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE ON

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-111700-1: State change: ACTIVE -> DOWN | Reason: ROUTED PNOTE

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-214704-1: Remote member 2 (state STANDBY -> ACTIVE) | Reason: No other ACTIVE members have been found in the cluster

May  7 10:08:23 2020 DETKDUSIPS09 routed[12380]: [routed] ERROR:   recv(header) returns 0

May  7 10:08:24 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: ROUTED PNOTE

May  7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationChange

May  7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationSave

May  7 10:08:29 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May  7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE OFF

May  7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

May  7 10:08:38 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

 

Does anybody have an idea what is causing this?

 

Thanks

Frank

 

0 Kudos
Reply
1 Solution
6 Replies
Highlighted
Contributor

Hello Heiko,

thanks for reply, I checked the links...

The Problem is, that the Pnote is only a few seconds. 

Pnote ON at 10:08:23 degrades the Member from Active to down. Then without any interaction the status resumes from down to standby (because the other node is now active) only 8 seconds later.

This happens while doing changes in routing contexts (create Interface, modify/add PBR,...). Very annoying when you need to do changes and almost every action causes a failover.

 

Regrads,

Frank

 

0 Kudos
Reply
Highlighted
Champion
Champion

Have you seen this SK:

sk109051: Troubleshooting Dynamic Routing - Cluster XL - PNOTE issues

If the routed pnote has a Timeout of "None" in the output of cphaprob -l list, even the slightest blip in that process will cause a failure of that pnote and an instant failover.  Perhaps you have a very large routed configuration and when it is changed the daemon goes "out to lunch" parsing the config for just long enough to trip the pnote?  The 5600 is not the speediest box in the world and that may be part of the issue.

It might be interesting to try increasing the timeout for the routed pnote to give it a little more leeway.  Would not recommend going beyond 2 or 3 seconds though.

Gaia 3.10 Immersion Self-paced Video Series
now available at http://www.maxpowerfirewalls.com
Highlighted
Contributor

Hi Tim,

 

thanks, that sounds interesting and I want to try it.  For increasing the timeout do I have to unregister the device and register it with new timeout?

 

cphaprob -d routed [-p] unregister

cphaprob -d routed -t <timeout in sec> -s ok [-p] register

 

If I do a "cphaprob -d routed unregister" it answers me with the list of usage...

How do I increase the timeout?

 

Thanks

Frank

0 Kudos
Reply
Champion
Champion

Actually check this SK first, as it is not immediately obvious how to modify the timeout for the routed pnote:

sk108069: "PNOTE Reporting" setting in Gaia OS causes frequent ClusterXL failovers

 

Gaia 3.10 Immersion Self-paced Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
Reply
Highlighted
Contributor

Hi,

the mentioned checkbox for PNOTE Reporting has gone with 77.30 and is disabled by default.

Seems not easy setting timeout for device routed...

Thanks

Frank

0 Kudos
Reply