- Products
- Learn
- Local User Groups
- Partners
- More
Quantum Spark Management Unleashed!
Check Point Named Leader
2025 Gartner® Magic Quadrant™ for Hybrid Mesh Firewall
HTTPS Inspection
Help us to understand your needs better
CheckMates Go:
SharePoint CVEs and More!
We recently upgraded a VSX VSLS cluster from R80.20 to R81.10 HF55 and the VSs are now reported as DOWN due to FWD pnote (FWD on Active VSX cluster member VSs are in Terminated state T).
Only one VS is in Active/Standby state while the rest are Active/Down.
Anyone faced such an issue recently?
See sample output.
Cluster name: CP-Cluster
Virtual Devices Status on each Cluster Member
=============================================
ID | Weight| CP-G| CP-G
| | W-1 | W-2
| | [local] |
-------+-------+-----------+-----------
2 | 10 | DOWN | ACTIVE
6 | 10 | ACTIVE(!) | DOWN
7 | 10 | DOWN | ACTIVE(!)
8 | 10 | DOWN | ACTIVE(!)
9 | 10 | DOWN | ACTIVE(!)
10 | 10 | STANDBY | ACTIVE
---------------+-----------+-----------
Active | 1 | 5
Weight | 10 | 50
Weight (%) | 16 | 84
Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down
Hopefully you are engaged with TAC on this issue, what was already attempted in respect to troubleshooting & recovery?
What process was followed to complete the upgrade, are both members upgraded at this point?
@Chris_Atkinson what is the effect of enabling dynamic_balancing after making the above changes?
Is it better to leave defaults and enable dynamic_balancing ?
--- Current CoreXL affinity & MQ settings ----------------------------------
[Expert@CP-GW2:0]# fw ctl affinity -l -a
VS_0 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_1 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_2 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_3 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_4 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_6 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_7 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_8 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_9 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_10 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Interface Mgmt: has multi queue enabled
Interface eth2-01: has multi queue enabled
Interface eth2-02: has multi queue enabled
Interface eth4-03: has multi queue enabled
Interface eth4-04: has multi queue enabled
Interface eth5-01: has multi queue enabled
Interface eth5-02: has multi queue enabled
Interface eth1-01: has multi queue enabled
Interface eth1-02: has multi queue enabled
[Expert@CP-GW2:0]#
[Expert@CP-GW2:0]# cpmq get -vv
Note: 'cpmq' is deprecated and no longer supported. For multiqueue management, please use 'mq_mng'
Current multiqueue status:
Total 48 cores. Available for MQ 4 cores
i/f driver driver mode state mode (queues) cores
actual/avail
------------------------------------------------------------------------------------------------
Mgmt igb Kernel Up Auto (2/2) 0,24
eth1-01 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth1-02 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth2-01 igb Kernel Up Auto (4/4) 0,24,1,25
eth2-02 igb Kernel Up Auto (4/4) 0,24,1,25
eth4-03 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth4-04 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth5-01 mlx5_core Kernel Up Auto (4/4) 0,24,1,25
eth5-02 mlx5_core Kernel Up Auto (4/4) 0,24,1,25
A prerequisite to start Dynamic Balancing, is having all FWKs set to the default FWKs CPUs.
@AmitShmuel talks about it here:
I will assume default settings means out of the box configuration with coreXL enabled on VS0 with 40 instances. From there dynamic balancing is assumed to do its magic.
Also, On further analysis using FW monitor, we also suspect that our Mellanox 2x40G NIC cards might be having issues while in bonded state. We will also try a firmware upgrade of the cards after testing the coreXL/MQ settings.
Update
Seems the 40G NICs required a firmware update after moving to R81.10. Traffic was being dropped on the 40G bond interface causing TCP SYN retransmissions leading to slow loading of web applications. After upgrading the firmware all was good but we have ran into a new issue as seen below from TAC.
" It seems that you have experienced a segmentation fault that is recently common in the later takes of R81.10 - this segmentation fault is usually causing an FWK crash and initiates a failover.
We do not have an official SK about it as it is in an internal R&D investigation.
Please install the hotfix I have provided you with, as segmentation faults could harm the machine and it is very important to act quickly with this matter."
One of our VS crashed and failed over to the standby VS (still running on R80.20).
@KostasGR yes we tried to reboot.
Are you noticing high CPU load for specific VS?
I ran into a similar issue after a VSX upgrade from R80.20 -> R81.10 but in HA mode.
Our standby member was flapping between Standby and DOWN due to missing Sync interface and the virtual router VS was missing its interfaces.
In addition the virtual router VS was consuming CPU up to 200% and a lot of Interface DRPs and OVR overall were visible.
We started tuning SNDs and FWKs but no change was resolving the issue.
Thanks to the last support engineer, who found a similar case, we were able to pinpoint it to priority queue and after disabling it the cluster became stable.
It would be worth in your case to check, if the spike detective is printing errors regarding the fwkX_hp process.
This one is responsible for the high priority queue in priority queue and clogged the specific VS in our case.
BR,
Markus
@Markus_Genser could you share the commands used to identify the issue and how to kill the priority queues.
Sure,
spike detective is reporting to /var/log/messages,
we got the following messages over and over again, especially during peak times with a lot of traffic passing through, note the fwk1_hp (according to TAC this is the high priority queue)
Jul 20 15:28:51 2022 <GWNAME> spike_detective: spike info: type: thread, thread id: 3383, thread name: fwk1_hp, start time: 20/07/22 15:28:26, spike duration (sec): 24, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
Jul 20 15:28:57 2022 <GWNAME> spike_detective: spike info: type: cpu, cpu core: 5, top consumer: fwk1_hp, start time: 20/07/22 15:28:26, spike duration (sec): 30, initial cpu usage: 84, average cpu usage: 79, perf taken: 1
Jul 20 15:29:03 2022 <GWNAME> spike_detective: spike info: type: cpu, cpu core: 21, top consumer: fwk1_hp, start time: 20/07/22 15:28:56, spike duration (sec): 6, initial cpu usage: 85, average cpu usage: 85, perf taken: 0
Jul 20 15:29:03 2022 <GWNAME> spike_detective: spike info: type: thread, thread id: 3383, thread name: fwk1_hp, start time: 20/07/22 15:28:56, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
For the virtual router, which is normally using 5-10% cpu, this is not normal
The rest was good old detective work with top to identify the VS causing the issue (virtual router in our case).
In the VS $FWDIR/log/fwk.elg to verify that the cluster status is caused by missing CCP packets, messages like this:
State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
even though interfaces are up and you can ping the neighbour and we had enough SNDs to handle all the traffic and all in all the rest of the VS were consuming cpu in a low level, which showed most of the cores idling.
The TAC engineer used one additional command in the remote session, which i failed to note, that showed the various CPU hits for the kernel modules in percent which also displayed the fwk1_hp on top and following all that, he suggested to turn of priority queue.
To deactivate priority queue :
BR,
Markus
@Markus_Genser thanks.
My take is that the merger of the R80.X SP (Maestro etc) with the normal R80.X has brought about many issues on R81.X platform. Just like the move from SecurePlatform to Gaia OS a few years ago. Running on R81.X is like walking on egg shells in production.
Dear Edward
I am sorry to hear that you had a bad experience with R81.10. We are investigating the reason for VSs to be down. As of the network interfaces firmware we moved towards auto-updates in future versions of the jumbo and major versions;
In general we get a great feedback about R81.10 quality from our partners and customers and we highly recommend upgrading to this version - not only management and regular gateways, but also Maestro environments, and actually the most widely used Maestro version is now R81.10 with the large number of Maestro customers that succesfully upgraded to R81.10.
Thank You
Today we did a fresh install upgrade of our 2nd gateway and we hit another problem.
The VSs on GW2 are complaining that the cluster interface is down hence CCP packets are not being received. This causes the VS to failover in some instances. We are sure the bond interface is fine as everything was working well in the multiversion cluster (R80.20 + R81.10). This just started with the final upgrade of the 2nd GW.
The sync bond interface within the VS is UP in one direction only as seen in the output below.
[Expert@CP-GW2:6]# cphaprob -a if
vsid 6:
------
CCP mode: Manual (Unicast)
Required interfaces: 8
Required secured interfaces: 1
Interface Name: Status:
bond0 (S-LS) Inbound: UP
Outbound: DOWN (1245.8 secs)
[Expert@CP-GW2:6]# cphaprob stat
Active PNOTEs: LPRB, IAC
Last member state change event:
Event Code: CLUS-110305
State change: ACTIVE -> ACTIVE(!)
Reason for state change: Interface bond0 is down (Cluster Control Protocol packets are not received)
TAC have tried to analyze interface packet details to no avail. We will try a full reboot later on.
My advise is for customers to work with R80.40 for VSX based on the number of issues being faced. I believe that version is fairly mature and has been out there for long.
May I suggest that customer success engineer from R&D organization will work with you to review the environment and to help and resolve R81.10 issues that you're having ?
Thanks
Gera
I will be glad to work with someone in R&D. What would you need from me?
@Edward_Waithaka please forward your contact details to @Gera_Dorfman via a personal message, or send them to me at vloukine@checkpoint.com, and I will pass it on to Gera and his team.
Done
The ccp probe issue seems to clear on its own sometimes, then comes back later on.
[Expert@CP-GW2:2]# cphaprob stat
Cluster Mode: Virtual System Load Sharing (Primary Up)
ID Unique Address Assigned Load State Name
1 10.10.100.169 0% STANDBY CP-GW-1
2 (local) 10.10.100.170 100% ACTIVE CP-GW-2
Active PNOTEs: None
Last member state change event:
Event Code: CLUS-114904
State change: ACTIVE(!) -> ACTIVE
Reason for state change: Reason for ACTIVE! alert has been resolved
Event time: Wed Jul 27 09:43:58 2022
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Available on member 1
Event time: Thu Jul 21 00:26:52 2022
Cluster failover count:
Failover counter: 1
Time of counter reset: Thu Jul 21 00:17:24 2022 (reboot)
[Expert@CP-GW2:2]# cat $FWDIR/log/fwk.elg | grep '27 Jul 9:43:'
[27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has started on interface: bond0
[27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has started on interface: bond2.xx
[27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has stopped on interface: bond2.xx
[27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local Probing PNOTE OFF
[27 Jul 9:43:58][fw4_0];[vs_2];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Hi,
this behaviour was visible in our issue, depending on the load.
In the meantime i was able to get the command used by the TAC engineer it was "perf top" (in VS0)
On top it was visible that the fkw1 was consuming >100%. (Screenshot is just an example)
We also encountered the described clusterXL state after finishing vsx_util reconfigure, that was solved with a reboot of the active member.
I would suggest you take Geras offer for a remote with R&D.
BR,
Markus
Noted. We have planned for a reboot later tonight as we wait to hear more hopefully from R&D.
We managed to link up with the RnD team and they offered great assistance.
It was immediately identified that we hit a MVC bug. We disabled MVC now that all gateways were on R81.10, pushed policies and all was good.
MVC was causing an issue with CCP unicast communication. A R81.10 GW with MVC enabled (Auto unicast) and a R81.10 GW without MVC (Full manual unicast) run different unicast protocols for the CCP communication.
MVC enabled GW (1st GW to be upgraded)
---------------------------------------
[Expert@CP-GW2:0]# fw ctl get int fwha_default_unicast_mode
fwha_default_unicast_mode = 0
[Expert@CP-GW2:0]#
MVC DISabled GW (ideal state)
------------------------------
[Expert@CP-GW1:0]# fw ctl get int fwha_default_unicast_mode
fwha_default_unicast_mode = 1
[Expert@CP-GW1:0]#
The issue with MVC caused 2 things, 1) perceived one way communication in the CCP protocol in some cases, this made the sync interface to be seen to be down. 2) The standby VS0 gateway was unable to communicate externally e.g. IPS updates, TACACS, Anti-bot url checks online (RAD process). This can lead to other issues especially with the AB blades if you don't have fail-open.
Any yes, if we would have gone up to the end of the upgrade process (disable mvc & push all polices), we wouldn't have been affected much by the bug. We decided to take precaution and figure out why things were not working as expected before completing the process.
RnD will be releasing a GA JHF soon with our bugs fixed, including the private HF we got for the segmentation fault. I don't think we are out of the woods yet but things are better so far, fingers crossed.
Lessons learnt so far.
I think these are actually listed in the know issues list scheduled for next Jumbo; but at least your issue is resolved which is great!
Did you push policies on them as well, after the upgrade?
Yes, we pushed several times.
We just did a R80.40 with R81.10 with JHFA66 (inplace upgrade).
We had an odd issue with one node where some wrp interface just appeared in VS0 and that cause a problem with HA.
We ended up removing VSX from the GW and then running vsx_util reconfigure which resolve the problem.
My main issues since upgrade is multicast has stopped working and we have determine the 'fw monitor' causes fwk process crash (yes TAC are engaged) and basically causes an outage.
What other surprises are waiting to be discovered..only time will tell.
b.t.w I am using dynamic balancing and so far not seeing an issue with this, it just works.
All the best. We just hit another issue today!
We had another outage today, so since upgrading to R81.10 JHFA66 we have experienced outages every day.
Contacted TAC and requested that R&D are involved immediately.
This is pretty bad Checkpoint, considering its the recommended release. TAC are investigating but clearly they need some time to work through crash logs.
It would be good if I could get some top boys in Checkpoint to be all over my case with the TAC engineer (who is great b.t.w!)
@genisis__ try a fresh install upgrade maybe.
Looking like a memory leak bug.
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
User | Count |
---|---|
17 | |
12 | |
7 | |
6 | |
6 | |
6 | |
6 | |
5 | |
3 | |
3 |
Fri 12 Sep 2025 @ 10:00 AM (CEST)
CheckMates Live Netherlands - Sessie 38: Harmony Email & CollaborationTue 16 Sep 2025 @ 02:00 PM (EDT)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - AmericasWed 17 Sep 2025 @ 04:00 PM (AEST)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - APACWed 17 Sep 2025 @ 03:00 PM (CEST)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - EMEAThu 18 Sep 2025 @ 03:00 PM (CEST)
Bridge the Unmanaged Device Gap with Enterprise Browser - EMEAFri 12 Sep 2025 @ 10:00 AM (CEST)
CheckMates Live Netherlands - Sessie 38: Harmony Email & CollaborationTue 16 Sep 2025 @ 02:00 PM (EDT)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - AmericasWed 17 Sep 2025 @ 04:00 PM (AEST)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - APACWed 17 Sep 2025 @ 03:00 PM (CEST)
Securing Applications with Check Point and AWS: A Unified WAF-as-a-Service Approach - EMEAThu 18 Sep 2025 @ 03:00 PM (CEST)
Bridge the Unmanaged Device Gap with Enterprise Browser - EMEAAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY