Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Gro_Tea
Contributor
Jump to solution

Cluster Falover after every change on pbr, routing, ospf,...

Hello,

I have two 5600 Appliances runninh in HA on R80.30 , but bevavoiur was already on 80.10.

Everytime I do a change in PBR Settings (adding Table etc),  add an interface, or add an ospf route distribution, the node on which I work gets degraded to down. If I do this on the primary node, a failover occurs.

 

 

May  7 10:08:22 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May  7 10:08:22 2020 DETKDUSIPS09 kernel: [fw4_1];[xxxxxxxxx:46770 -> xxxxxxxxxxxx] [ERROR]: cmik_loader_fw_context_match_cb: match_cb for CMI APP 3 failed on context 56, executing context 366 and adding the app to apps in exception

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE:  task_cmd_init(143): command subsystem initialized.

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE:  Start routed[27001] version routed-12.30.2019-11:21:08 instance 0

May  7 10:08:23 2020 DETKDUSIPS09 routed[27001]: routed_syslog_on: tracing to "/var/log/routed_messages" started

May  7 10:08:23 2020 DETKDUSIPS09 kernel: Passive ARP hook already uninstalled!

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];Global param: set int fwha_cbs_which_member_is_running_gated to '0'

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE ON

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-111700-1: State change: ACTIVE -> DOWN | Reason: ROUTED PNOTE

May  7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-214704-1: Remote member 2 (state STANDBY -> ACTIVE) | Reason: No other ACTIVE members have been found in the cluster

May  7 10:08:23 2020 DETKDUSIPS09 routed[12380]: [routed] ERROR:   recv(header) returns 0

May  7 10:08:24 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: ROUTED PNOTE

May  7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationChange

May  7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationSave

May  7 10:08:29 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May  7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE OFF

May  7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

May  7 10:08:38 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

 

Does anybody have an idea what is causing this?

 

Thanks

Frank

 

0 Kudos
1 Solution

Accepted Solutions
HeikoAnkenbrand
Champion Champion
Champion
19 Replies
HeikoAnkenbrand
Champion Champion
Champion
Gro_Tea
Contributor

Hello Heiko,

thanks for reply, I checked the links...

The Problem is, that the Pnote is only a few seconds. 

Pnote ON at 10:08:23 degrades the Member from Active to down. Then without any interaction the status resumes from down to standby (because the other node is now active) only 8 seconds later.

This happens while doing changes in routing contexts (create Interface, modify/add PBR,...). Very annoying when you need to do changes and almost every action causes a failover.

 

Regrads,

Frank

 

0 Kudos
Timothy_Hall
Legend Legend
Legend

Have you seen this SK:

sk109051: Troubleshooting Dynamic Routing - Cluster XL - PNOTE issues

If the routed pnote has a Timeout of "None" in the output of cphaprob -l list, even the slightest blip in that process will cause a failure of that pnote and an instant failover.  Perhaps you have a very large routed configuration and when it is changed the daemon goes "out to lunch" parsing the config for just long enough to trip the pnote?  The 5600 is not the speediest box in the world and that may be part of the issue.

It might be interesting to try increasing the timeout for the routed pnote to give it a little more leeway.  Would not recommend going beyond 2 or 3 seconds though.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon
Gro_Tea
Contributor

Hi Tim,

 

thanks, that sounds interesting and I want to try it.  For increasing the timeout do I have to unregister the device and register it with new timeout?

 

cphaprob -d routed [-p] unregister

cphaprob -d routed -t <timeout in sec> -s ok [-p] register

 

If I do a "cphaprob -d routed unregister" it answers me with the list of usage...

How do I increase the timeout?

 

Thanks

Frank

0 Kudos
Timothy_Hall
Legend Legend
Legend

Actually check this SK first, as it is not immediately obvious how to modify the timeout for the routed pnote:

sk108069: "PNOTE Reporting" setting in Gaia OS causes frequent ClusterXL failovers

 

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon
0 Kudos
Gro_Tea
Contributor

Hi,

the mentioned checkbox for PNOTE Reporting has gone with 77.30 and is disabled by default.

Seems not easy setting timeout for device routed...

Thanks

Frank

0 Kudos
ajax
Explorer

Hi, Frank!

Did you found any solution for this trouble with failover after creating interface, adding route, and so on?

Now i have same trouble, but can't found how to fix this.

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Are you applying your changes first on the standby or active?

CCSM R77/R80/ELITE
0 Kudos
ajax
Explorer

Hello! On Active first. Is this not right?

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Please see sk57100 for an example process

https://support.checkpoint.com/results/sk/sk57100

CCSM R77/R80/ELITE
0 Kudos
ajax
Explorer

Hello! Thank you for answer. I understand about adding / deleting interfaces, but we have failover when add / edit ip routes. If we add static route or redistribute static route to ospf on active node, after few second we get failover. You can see messages on active node (i added part of messages file).

0 Kudos
Chris_Atkinson
Employee Employee
Employee

For context are you seeing impact from the failover or you just observe that it happens...

Are you using graceful restart?

Are the ospf router-id aligned for both members?

CCSM R77/R80/ELITE
0 Kudos
ajax
Explorer

Yes, all 4 nodes have same router-id in OSPF process.

About impact from failover - i can't say exactly, but anyway we need to understand, why we get failover after add/edit routes and how to fix this one.

0 Kudos
ajax
Explorer

I checked on my lab with virtual CP , and when i redistribute static route or direct connected interface to ospf , i don't have this messages and don't have failover (but in real CP cluster have this messages! And after this messages we get failover):

May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: task_cmd_init(145): command subsystem initialized.
May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: Start routed[31560] version routed-11.06.2024-17:55:19 instance 0
May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: mc_enabling_check_startup(131): Starting up with multicast routing enabled (see routed_messages for subsequent messages)
May 15 09:06:25 2025 cp-int-1 routed[31560]: routed_syslog_on: tracing to "/var/log/routed_messages" started

 

Somebody know, what mean this messages?

0 Kudos
ajax
Explorer

Hello!

I found solution for trouble with failover after every change of routing. 

It turned out that the problem was that we had a static route to a subnet for NAT registered on the gateways and the cluster address of one of the interfaces was specified as the next-hop address. Then this static route was redistributed to the OSPF for neighboring routers. And the gateways in routed_messages complained that the next-hop address for these static routes belonged to the local interface, which was kind of wrong. As soon as I deleted these static routes, the failover error disappeared and now after changing the routing, the node activity does not change.

I have this error on stand, where i simulate same problem (you can see in the attachment)

 

 

 

 

Oliver_Fink
Advisor
Advisor

In the logs you see that routed gets started. I guess that is because it was stopped before – maybe to reload configuration. But routed is a registered device to the cluster.

# cphaprob -l li

[…]

Registered Devices:

[…]

Device Name: routed
Registration number: 2
Timeout: none
Current state: OK
Time since last report: XXXX sec

If a registered device is not availabe the cluster fails over.

0 Kudos
ajax
Explorer

Thank you for answer!

I checked , and all members of cluster have registered devices: routed , and status OK.

0 Kudos
Oliver_Fink
Advisor
Advisor

Yes, the process routed gets started – as I wrote. Before that the process routed seems to be stopped. That is why the cluster fails over. After the (re)start of the routed everything is fine again – except that the cluster runs on the other node. The cluster fails over in the time between stop and start of the routed process, I guess.

0 Kudos
G_W_Albrecht
Legend Legend
Legend

Frank had the issue 5 years ago, so your question is kind of... I would suggest to create a post yourself !

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events