Solved: Policy Installation failed on GW | sk125152 | High...

KamilZet · ‎2023-09-06

Hi All,

I would like to share with our ongoing issue which i cannot solved and so far have not received interesting feedback from TAC. So maybe you had something similar and you did manage to solve it.

Thus my cluser is cp 6600 in VRRP mode , sync only. gaia 81.10 , take 110.

My problem started from failed policy installation and we got following meesage :

"Policy instllation failed on gateway. Cluster policy instllation failed (see sk125152)"

After that we noticed higher cpu than normally and some cores had peaks up to 100%. Normally it was arround 20-30%. So in my view there is correlation between policy instllation failure and high cpu. Some acion could even prove it = i installed latest hotfix take 110 and after reboot all looked really good but again tried to install policy what ends with failure and high cpu re-occur.

So i was digging deeper and sk indicates that it could be a problem with HA/ClusterXL. I found out that i cannot ping 2nd Sync node ip address. weird thing is that i checked the switches where ports from firewalls are directly connected ( access vlan , both in a same,) and in both access switch there is no mac on direct port leading to sync interface... output from firewall just prove that in arp table ip which i am trying to ping has "incomplete" mac<->ip resolution.. Same on both ends on different access switch..

so topology is like below :

fw node1 Sync port ---> access switch dc1 vlan 1000 ---> fiber between dc --> access switch dc2 vlan 1000 -->fw node2 Sync port

do you know what i could check further??

i shut/unshut ports on fw/switches without any success. Is it possibile that some HA processes hanged, crushed and its not sending any traffic and switch cannot put mac on particular port ?

thank you in advance for any hints

KamilZet · ‎2023-09-07

Just managed to solve it.. it was absolutelly our fault as vlan was removed due to migration ( by mistake ) on vtp server what cause removing it also from all clients. So access vlan was configured on port etc but in fact there was no such vlan anymore 🙂 and noone was looking in the easiest part but digging in logs/changes etc ...

So recovering communication on a sync link solved high cpu ( quite interesting why , maybe due to having vrrp still in proper state but clusterXl sync had troubles ?? ) , installation of policy etc

thank you all for you suggestion and help

View solution in original post

Timothy_Hall · ‎2023-09-06

This "Cluster policy installation failed" message no longer only means that the atomic load/commit failed or timed out on one of the cluster members, in R81+ it can also indicate that some kind of cluster sanity check failed during policy installation. You'll need to look in $FWDIR/log/cphaconf.elg on both members for clues about what is wrong. So far I've seen this message indicate:

1) One of the cluster members is set for MVC and one is not (sk179969: Policy installation fails with error "Policy installation failed on gateway. Clusterpolicy...")

2) The state of cluster enablement in cpconfig is incorrect (enabled for a non-cluster object, or disabled for a gateway that is part of a cluster object - sk180980: Policy installation failure with error message "Policy installation failed on gateway. Clu...

There are probably some other sanity checks I haven't run into yet.

The fact that you can't ARP on the sync network is a definite problem, and may be another one of the new sanity checks that are performed; namely making sure that the sync network is working, assuming state sync is enabled on the cluster object. ARP is never denied by a security policy or antispoofing so I'd look there. The high CPU is probably a symptom of the problem rather than the cause, unless it is so extreme it is causing a commit timeout on one of the gateways.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

KamilZet · ‎2023-09-07

Thx for joining to conversation. I will review mentioned logs by you : cphaconf.elg. Regarding SK which you shared both are not related to me :

1.

[Expert@fw-de-niest-01:0]# cphaprob mvc

OFF

[Expert@fw-de-niest-02:0]# cphaprob mvc

OFF

2. i have clusterxl sync only with vrrp and it is configured on mgmt

in addition to a problem we captured a packets on switch with direct connection to Sync port on FW where i am not seeing mac and there is only traffic like this :

the_rock · ‎2023-09-06

Just to be sure of cluster state, can you send below from both members?

Andy

cphaprob roles

cphaprob state

cphaprob list

cphaprob -a if

cphaprob syncstat

KamilZet · ‎2023-09-06

here you are:

fw1:

[Expert@fw-de-niest-01:0]# cphaprob roles

ID Role

1 (local) Master

[Expert@fw-de-niest-01:0]# cphaprob state

Cluster Mode: Sync only (OPSEC) with IGMP Membership

ID Unique Address Firewall State (*)

1 (local) 192.168.10.226 Active

(*) FW-1 monitors only the sync operation and the security policy
Use OPSEC's monitoring tool to get the cluster status
[Expert@fw-de-niest-01:0]# cphaprob list

There are no pnotes in problem state

[Expert@fw-de-niest-01:0]# cphaprob -a if

CCP mode: Manual (Multicast)
Sync sync(secured), multicast
Mgmt non sync(non secured)
eth1-04 non sync(non secured)
eth1-02 non sync(non secured)
eth1-03 non sync(non secured)
eth1-01 non sync(non secured)
eth1-02 non sync(non secured)

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 19

eth1-04 xxxxx ( x just to hide in use ip addresses )
eth1-02.2001 xxxxx
eth1-02.3507 xxxxxx
eth1-02.3503 xxxxx
eth1-02.3524 xxxxx
eth1-02.2100 xxxxx
eth1-02.3505 xxxxx
eth1-02.2030 xxxxx
eth1-03.2032 xxxxx
eth1-02.3504 xxxxx
eth1-01.2086 xxxxx
eth1-02.3508 xxxxx
eth1-02.2031 xxxxx
eth1-02.3529 xxxxx
eth1-02.3587 xxxxx
eth1-02.3588 xxxxx
eth1-02.3523 xxxxx
eth1-02.2084 xxxxx
eth1-02.3510 xxxxx

[Expert@fw-de-niest-01:0]# cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 45951141
Sent retransmission requests................. 0
Sent retransmission updates.................. 0
Peak fragments per update.................... 2

Received messages:
Total received updates....................... 0
Received retransmission requests............. 0

Sync Interface:
Name......................................... Sync
Link speed................................... 1000Mb/s
Rate......................................... 5178 [KBps]
Peak rate.................................... 7906 [KBps]
Link usage................................... 4%
Total........................................ 655036[MB]

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Tue Sep 5 22:14:50 2023 (triggered by fullsync).

fw2:

[Expert@fw-de-niest-02:0]# cphaprob roles

ID Role

2 (local) Non-Master

[Expert@fw-de-niest-02:0]# cphaprob state

Cluster Mode: Sync only (OPSEC) with IGMP Membership

ID Unique Address Firewall State (*)

2 (local) 192.168.10.227 Active

(*) FW-1 monitors only the sync operation and the security policy
Use OPSEC's monitoring tool to get the cluster status
[Expert@fw-de-niest-02:0]# cphaprob list

There are no pnotes in problem state

[Expert@fw-de-niest-02:0]# cphaprob -a if

CCP mode: Manual (Multicast)
Sync sync(secured), multicast
Mgmt non sync(non secured)
eth1-04 non sync(non secured)
eth1-02 non sync(non secured)
eth1-03 non sync(non secured)
eth1-01 non sync(non secured)
eth1-02 non sync(non secured)

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 19

eth1-04 xxxxxxx
eth1-02.2001 xxxxxxx
eth1-02.3507 xxxxxxx
eth1-02.3503 xxxxxxx
eth1-02.3524 xxxxxxx
eth1-02.2100 xxxxxxx
eth1-02.3505 xxxxxxx
eth1-02.2030 xxxxxxx
eth1-03.2032 xxxxxxx
eth1-02.3504 xxxxxxx
eth1-01.2086 xxxxxxx
eth1-02.3508 xxxxxxx
eth1-02.2031 xxxxxxx
eth1-02.3529 xxxxxxx
eth1-02.3587 xxxxxxx
eth1-02.3588 xxxxxxx
eth1-02.3523 xxxxxxx
eth1-02.2084 xxxxxxx
eth1-02.3510 xxxxxxx

[Expert@fw-de-niest-02:0]# cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 3349848
Sent retransmission requests................. 0
Sent retransmission updates.................. 0
Peak fragments per update.................... 2

Received messages:
Total received updates....................... 0
Received retransmission requests............. 0

Sync Interface:
Name......................................... Sync
Link speed................................... 1000Mb/s
Rate......................................... 16620 [Bps]
Peak rate.................................... 4815 [KBps]
Link usage................................... 0%
Total........................................ 4976 [MB]

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Tue Sep 5 21:30:04 2023 (triggered by fullsync).

the_rock · ‎2023-09-07

This is your issue...BOTH members "think" they are active, as neither shows as backup. Can you verify in topology that you have configured all those interfaces as clustered AND you can also get interfaces without topology option?

Andy

KamilZet · ‎2023-09-07

Just managed to solve it.. it was absolutelly our fault as vlan was removed due to migration ( by mistake ) on vtp server what cause removing it also from all clients. So access vlan was configured on port etc but in fact there was no such vlan anymore 🙂 and noone was looking in the easiest part but digging in logs/changes etc ...

So recovering communication on a sync link solved high cpu ( quite interesting why , maybe due to having vrrp still in proper state but clusterXl sync had troubles ?? ) , installation of policy etc

thank you all for you suggestion and help

the_rock · ‎2023-09-07

Yep, thats exactly it. Its important to remember, unlike most other major vendors, changes in CP cluster do NOT replicate automatically from master to backup, like they do in Cisco, FGT, PAN.

Great job btw 👍✔

Cheers,

Andy

Are you a member of CheckMates?

Policy Installation failed on GW | sk125152 | High Cpu | Problem wih sync link