- Products
- Learn
- Local User Groups
- Partners
- More
Quantum Spark Management Unleashed!
Introducing Check Point Quantum Spark 2500:
Smarter Security, Faster Connectivity, and Simpler MSP Management!
Check Point Named Leader
2025 Gartner® Magic Quadrant™ for Hybrid Mesh Firewall
HTTPS Inspection
Help us to understand your needs better
CheckMates Go:
SharePoint CVEs and More!
Hello,
I'd appreciate any tips or hints in finding out the traffic that causes cluster fail-over for some minutes.
We use R81.20 with the latest Take (105) on open servers. Every night, between 3am and 4am, there is high CPU observed for some minutes which triggers fail-over to the standby node, back and forth.
It looks like some sort of scheduled backup happening, but I can't figure out IP addresses involved.
I get the following in /var/log/messages:
Jul 24 03:46:10 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (82%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:46:41 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:46:43 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:46:55 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:47:00 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:47:19 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 19 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:47:40 2025 GW-N02 kernel:[fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
Jul 24 03:47:41 2025 GW-N02 kernel:[fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Jul 24 03:47:49 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (81%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:48:01 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 04:04:54 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (85%) on the local member increased above the configured threshold (80%).
Jul 24 04:05:58 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 04:06:02 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (92%) on the local member increased above the configured threshold (80%).
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Jul 24 04:06:36 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 18 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Trying to replay with 'cpview -t' doesn't help much.
At 03:46-03:47 no high CPU is recorded (I think it didn't fit into the 1 minute interval used with 'cpview -t'):
At 04:06, I do see high load which goes to normal 2 minutes after:
I was hoping to get some stats from CPU > Top-Connections, but nothing was logged:
Thank you in advance.
Thank you @G_W_Albrecht !
As per SK the tool seems to be already running. I see logs generated under /var/log/spike_detective and an entry that seems to correspond to the time of the issue:
spike info: type: cpu, cpu core: 3, top consumer: unknown, start time: 24/07/25 03:01:48, spike duration (sec): 5, initial cpu usage: 82, average cpu usage: 82, perf taken: 1
I don't see any other helpful information, but perhaps I have to enable top_conns_enable and heavy_conns_enable.
Hopefully tomorrow I will have more data.
Other possibility is to fine-tune CUL https://support.checkpoint.com/results/sk/sk92723
Will check -- thank you!
It's not happening during the policy push nor during regular office hours, but only during a specific time at night. I tried to correlate it with the network load and there is a corresponding spike between data centers, so it's clearly some backup job...
Let's start with a HCP health check:
Full HCP report - HCP (sk171436) from active member.
1. # hcp --enable-product Performance
2. To generate a report, you can run "hcp -r all --include-wts yes". After running that command, you can download the following file: /var/log/hcp/last/hcp_last_report.tar.gz.
Share anything worth sharing here that might explain the issue. Like coredumps, rx errors etc
Hello @Lesley
All are green with one error (interface errors) and two warnings:
Warning 1:
Description
Drop templates are enabled but SecureXL drops show 0 packets dropped due to templates
Suggested Solution
Solution ##
Consult with TAC for further assistance
Warning 2:
Inspecting Dmesg output
2025-07-24 10:53:59 GW-N02 kernel:[fw4_0];cptls_rl_decrypt: cptls_rl_decrypt_ex failed on line: 532. version = 0x0303, read_state: ffffc90150ad1bb8, write_state: (null). states:
Description
Detected dmesg entries, from the last 1 hour/s, which require further inspection
Suggested Solution
Solution #1
Search for the specific error in existing SKs (and follow the suggested comments/solutions)
Concerning interface errors: we can't see corresponding CRCs on the switch side (both for bond and individual interfaces).
Time to check the rx buffers follow this
https://support.checkpoint.com/results/sk/sk42181
I recall troubleshooting it when we started noticing those after the migration from R77.30, but we were never nearly close to 0.1% as per the post from Timothy_Hall (I'm sorry, I can't find it anymore), hence the default rx value is used for all 10G interfaces which is 512.
He also recommended to avoid using ixgbe drivers for terminating VPN tunnels which is our case (not a lot of tunnels though).
Either way, we don't really experience any performance or latency issues, I just need to find the backup job that causes cluster failover at 3am. 🙂
Of course, any advice from legend @Timothy_Hall would be golden!
Andy
The CUL messages are merely a symptom of the problem, not the issue itself. The CUL state only matters if the active member suffers a catastrophic failover while CUL is active; then that failover will take 8-9 seconds instead of 2-3.
fw ctl multik print_heavy_conn only shows elephant flows over the last 24 hours. So, if you ran it within the proper window of time and it still shows nothing, that would suggest a large number of smaller connections saturating all CPUs simultaneously, and none of them stand out as a "spike" or elephant. This also explains why you aren't seeing a bunch of spikes logged by the spike detective. It could even be a burst of new accepted connections from a network monitoring system doing aggressive discovery at that time, check the new connections rate in cpview's history mode for that period, and see if it jumps significantly as that can also drive up the CPUs.
What I would suggest is writing a script that runs the command top_conns every 30 seconds and writes the results to a file. (sk172229: Top Connections Tool) Leave it running overnight, and that should help you figure out what is going on, as I assume the regular logs are not providing helpful information. My guess is that this burst of connections is traversing the slowpath through the gateway, which can easily saturate the CPUs. fw tab -t connections -z can help you diagnose why these connections are stuck in the slowpath. See my Be Your Own TAC Part Deux presentation for all the slowpath reasons. If the connections are not stuck F2F/slowpath, this would be a textbook case for using fast_accel for this spiky traffic, assuming the two communicating systems are reasonably trusted. sk156672 - SecureXL Fast Accelerator (fw fast_accel)
The RX-DRPs might be related, or they might just be junk traffic the firewall cannot process anyway (IPv6, etc). We need to understand the properties of these spiky connections before examining these drops/misses.
Thank you @Timothy_Hall ! I think using top_conns is the right direction.
Stupid question 🙂 -- I can't seem to find a way to schedule a cron job to run every XX seconds using clish since it only accepts hh:mm:
Schedule jobs to run on specified time in day. Please specify time in a format <hours>:<minutes>.
Shall I try to place a script using 'crontab -e' and put */30 (every 30 seconds)? But then it explicitly says not to edit the file directly.
Find the answer here: sk77300: How to create a scheduled job (cron job) in Gaia with frequency of less than a day
Just do this from expert mode and leave it running overnight (it will run every 30 seconds for 8 hours , increase the 960 value for longer):
TMOUT=0; export TMOUT
for run in {1..960}; do date >>outputfile; top_conns >>outputfile; sleep 30; done
Thank you very much @Timothy_Hall !
There were some good suggestions in below post
Also, can you send output of below:
fw ctl multik print_heavy_conn
Andy
Something else to check, see what it gives you, you can do below any time from clish.
Andy
CP-MANAGEMENT> show back
backup - Show the status of the latest backup/restore
backup-scheduled - Show the scheduling of backup defined in the system
backups - List of local backups
CP-MANAGEMENT> show backup-sch
CP-MANAGEMENT> show backup-scheduled
CP-MANAGEMENT> show backup-scheduled
Backup of the system configuration.
CP-MANAGEMENT> show backup-scheduled
CLINFR0349 Incomplete command.
CP-MANAGEMENT> show backup-scheduled jovs
NMSCRN0083 Invalid scheduled-backup name.
CP-MANAGEMENT> show cron
CP-MANAGEMENT> show cron
job - Show the scheduling information of the specific job
jobs - Show all scheduled jobs in the system
mailto - Show email address for sending jobs' results
CP-MANAGEMENT> show cron jobs
CP-MANAGEMENT>
I think he means backup traffic from other network components going via the firewall and are inspected 😉
But it is worth to check if scheduled gaia backup matches CPU spikes of course.
If you know it is backup traffic then focus on following topics: hyper flow, dynamic balancing and fast accel.
Definitely all worth checking 🙂
Yep, exactly that. 🙂
The backup of the gateway itself is configured at 2am, lasts ~30 seconds and has no impact on the performance. This is some task that is happening over the network and it's traversing via the firewall.
Hello @the_rock
'fw ctl multik print_heavy_conn' returns nothing.
Since I like to think about anything in life logically, more I think about this, more Im convinced there must be some sort of backup or cron job going on in the middle of the night causing this. Just my 2 cents.
Andy
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
User | Count |
---|---|
17 | |
12 | |
7 | |
6 | |
5 | |
5 | |
4 | |
3 | |
3 | |
3 |
Wed 10 Sep 2025 @ 11:00 AM (CEST)
Effortless Web Application & API Security with AI-Powered WAF, an intro to CloudGuard WAFWed 10 Sep 2025 @ 11:00 AM (EDT)
Quantum Spark Management Unleashed: Hands-On TechTalk for MSPs Managing SMB NetworksFri 12 Sep 2025 @ 10:00 AM (CEST)
CheckMates Live Netherlands - Sessie 38: Harmony Email & CollaborationWed 10 Sep 2025 @ 11:00 AM (EDT)
Quantum Spark Management Unleashed: Hands-On TechTalk for MSPs Managing SMB NetworksFri 12 Sep 2025 @ 10:00 AM (CEST)
CheckMates Live Netherlands - Sessie 38: Harmony Email & CollaborationAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY