Solved: Cluster Falover after every change on pbr, routing...

Gro_Tea · ‎2020-05-13

Hello,

I have two 5600 Appliances runninh in HA on R80.30 , but bevavoiur was already on 80.10.

Everytime I do a change in PBR Settings (adding Table etc), add an interface, or add an ospf route distribution, the node on which I work gets degraded to down. If I do this on the primary node, a failover occurs.

May 7 10:08:22 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May 7 10:08:22 2020 DETKDUSIPS09 kernel: [fw4_1];[xxxxxxxxx:46770 -> xxxxxxxxxxxx] [ERROR]: cmik_loader_fw_context_match_cb: match_cb for CMI APP 3 failed on context 56, executing context 366 and adding the app to apps in exception

May 7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE: task_cmd_init(143): command subsystem initialized.

May 7 10:08:23 2020 DETKDUSIPS09 routed[27001]: [routed] NOTICE: Start routed[27001] version routed-12.30.2019-11:21:08 instance 0

May 7 10:08:23 2020 DETKDUSIPS09 routed[27001]: routed_syslog_on: tracing to "/var/log/routed_messages" started

May 7 10:08:23 2020 DETKDUSIPS09 kernel: Passive ARP hook already uninstalled!

May 7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];Global param: set int fwha_cbs_which_member_is_running_gated to '0'

May 7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE ON

May 7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-111700-1: State change: ACTIVE -> DOWN | Reason: ROUTED PNOTE

May 7 10:08:23 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-214704-1: Remote member 2 (state STANDBY -> ACTIVE) | Reason: No other ACTIVE members have been found in the cluster

May 7 10:08:23 2020 DETKDUSIPS09 routed[12380]: [routed] ERROR: recv(header) returns 0

May 7 10:08:24 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: ROUTED PNOTE

May 7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationChange

May 7 10:08:25 2020 DETKDUSIPS09 xpand[10387]: admin localhost t -volatile:configurationSave

May 7 10:08:29 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

May 7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-120105-1: routed PNOTE OFF

May 7 10:08:31 2020 DETKDUSIPS09 kernel: [fw4_1];CLUS-114802-1: State change: DOWN -> STANDBY | Reason: There is already an ACTIVE member in the cluster (member 2)

May 7 10:08:38 2020 DETKDUSIPS09 kernel: [SIM4];sim_restore_ip_options: failed to properly restore IP options

Does anybody have an idea what is causing this?

Thanks

Frank

HeikoAnkenbrand · ‎2020-05-13

Hi @Gro_Tea

Maybe this sk will help you:
sk131352: Cluster member is down and routed pnote is in a problem state
or
sk62570: How to troubleshoot failovers in ClusterXL - Advanced Guide

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

HeikoAnkenbrand · ‎2020-05-13

Hi @Gro_Tea

Maybe this sk will help you:
sk131352: Cluster member is down and routed pnote is in a problem state
or
sk62570: How to troubleshoot failovers in ClusterXL - Advanced Guide

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

Gro_Tea · ‎2020-05-14

Hello Heiko,

thanks for reply, I checked the links...

The Problem is, that the Pnote is only a few seconds.

Pnote ON at 10:08:23 degrades the Member from Active to down. Then without any interaction the status resumes from down to standby (because the other node is now active) only 8 seconds later.

This happens while doing changes in routing contexts (create Interface, modify/add PBR,...). Very annoying when you need to do changes and almost every action causes a failover.

Regrads,

Frank

Timothy_Hall · ‎2020-05-14

Have you seen this SK:

sk109051: Troubleshooting Dynamic Routing - Cluster XL - PNOTE issues

If the routed pnote has a Timeout of "None" in the output of cphaprob -l list, even the slightest blip in that process will cause a failure of that pnote and an instant failover. Perhaps you have a very large routed configuration and when it is changed the daemon goes "out to lunch" parsing the config for just long enough to trip the pnote? The 5600 is not the speediest box in the world and that may be part of the issue.

It might be interesting to try increasing the timeout for the routed pnote to give it a little more leeway. Would not recommend going beyond 2 or 3 seconds though.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon

Gro_Tea · ‎2020-05-15

Hi Tim,

thanks, that sounds interesting and I want to try it. For increasing the timeout do I have to unregister the device and register it with new timeout?

cphaprob -d routed [-p] unregister

cphaprob -d routed -t <timeout in sec> -s ok [-p] register

If I do a "cphaprob -d routed unregister" it answers me with the list of usage...

How do I increase the timeout?

Thanks

Frank

Timothy_Hall · ‎2020-05-15

Actually check this SK first, as it is not immediately obvious how to modify the timeout for the routed pnote:

sk108069: "PNOTE Reporting" setting in Gaia OS causes frequent ClusterXL failovers

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon

Gro_Tea · ‎2020-05-15

Hi,

the mentioned checkbox for PNOTE Reporting has gone with 77.30 and is disabled by default.

Seems not easy setting timeout for device routed...

Thanks

Frank

ajax · ‎2025-05-20

Hi, Frank!

Did you found any solution for this trouble with failover after creating interface, adding route, and so on?

Now i have same trouble, but can't found how to fix this.

Chris_Atkinson · ‎2025-05-20

Are you applying your changes first on the standby or active?

CCSM R77/R80/ELITE

ajax

Hello! On Active first. Is this not right?

Chris_Atkinson

Please see sk57100 for an example process

https://support.checkpoint.com/results/sk/sk57100

CCSM R77/R80/ELITE

ajax

Hello! Thank you for answer. I understand about adding / deleting interfaces, but we have failover when add / edit ip routes. If we add static route or redistribute static route to ospf on active node, after few second we get failover. You can see messages on active node (i added part of messages file).

Chris_Atkinson

For context are you seeing impact from the failover or you just observe that it happens...

Are you using graceful restart?

Are the ospf router-id aligned for both members?

CCSM R77/R80/ELITE

ajax

Yes, all 4 nodes have same router-id in OSPF process.

About impact from failover - i can't say exactly, but anyway we need to understand, why we get failover after add/edit routes and how to fix this one.

ajax

I checked on my lab with virtual CP , and when i redistribute static route or direct connected interface to ospf , i don't have this messages and don't have failover (but in real CP cluster have this messages! And after this messages we get failover):

May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: task_cmd_init(145): command subsystem initialized.
May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: Start routed[31560] version routed-11.06.2024-17:55:19 instance 0
May 15 09:06:25 2025 cp-int-1 routed[31560]: [routed] NOTICE: mc_enabling_check_startup(131): Starting up with multicast routing enabled (see routed_messages for subsequent messages)
May 15 09:06:25 2025 cp-int-1 routed[31560]: routed_syslog_on: tracing to "/var/log/routed_messages" started

Somebody know, what mean this messages?

ajax

Hello!

I found solution for trouble with failover after every change of routing.

It turned out that the problem was that we had a static route to a subnet for NAT registered on the gateways and the cluster address of one of the interfaces was specified as the next-hop address. Then this static route was redistributed to the OSPF for neighboring routers. And the gateways in routed_messages complained that the next-hop address for these static routes belonged to the local interface, which was kind of wrong. As soon as I deleted these static routes, the failover error disappeared and now after changing the routing, the node activity does not change.

I have this error on stand, where i simulate same problem (you can see in the attachment)

Oliver_Fink

In the logs you see that routed gets started. I guess that is because it was stopped before – maybe to reload configuration. But routed is a registered device to the cluster.

# cphaprob -l li

[…]

Registered Devices:

[…]

Device Name: routed
Registration number: 2
Timeout: none
Current state: OK
Time since last report: XXXX sec

If a registered device is not availabe the cluster fails over.

ajax

Thank you for answer!

I checked , and all members of cluster have registered devices: routed , and status OK.

Oliver_Fink

Yes, the process routed gets started – as I wrote. Before that the process routed seems to be stopped. That is why the cluster fails over. After the (re)start of the routed everything is fine again – except that the cluster runs on the other node. The cluster fails over in the time between stop and start of the routed process, I guess.

G_W_Albrecht

Frank had the issue 5 years ago, so your question is kind of... I would suggest to create a post yourself !

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

Cluster Falover after every change on pbr, routing, ospf,...