Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Teddy_Brewski
Advisor

Help needed: Load analysis

Hello,

I'd appreciate any tips or hints in finding out the traffic that causes cluster fail-over for some minutes.

We use R81.20 with the latest Take (105) on open servers. Every night, between 3am and 4am, there is high CPU observed for some minutes which triggers fail-over to the standby node, back and forth.

It looks like some sort of scheduled backup happening, but I can't figure out IP addresses involved.

I get the following in /var/log/messages:

Jul 24 03:46:10 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (82%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:46:41 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:46:43 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:46:55 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:47:00 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:47:19 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 19 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:47:40 2025 GW-N02 kernel:[fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
Jul 24 03:47:41 2025 GW-N02 kernel:[fw4_1];CLUS-114904-1: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Jul 24 03:47:49 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (81%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:48:01 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 04:04:54 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (85%) on the local member increased above the configured threshold (80%).
Jul 24 04:05:58 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 04:06:02 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (92%) on the local member increased above the configured threshold (80%).
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Jul 24 04:06:36 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 18 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

 

Trying to replay with 'cpview -t' doesn't help much.

At 03:46-03:47 no high CPU is recorded (I think it didn't fit into the 1 minute interval used with 'cpview -t'):

20250724-fw-node2-001xxx.PNG

At 04:06, I do see high load which goes to normal 2 minutes after:

20250724-fw-node2-002xxx.PNG

20250724-fw-node2-003xxx.PNG

20250724-fw-node2-004xxx.PNG

I was hoping to get some stats from CPU > Top-Connections, but nothing was logged:

20250724-fw-node2-005xxx.PNG

Thank you in advance.

0 Kudos
21 Replies
G_W_Albrecht
Legend Legend
Legend

Teddy_Brewski
Advisor

Thank you @G_W_Albrecht !

As per SK the tool seems to be already running.  I see logs generated under /var/log/spike_detective and an entry that seems to correspond to the time of the issue:

spike info: type: cpu, cpu core: 3, top consumer: unknown, start time: 24/07/25 03:01:48, spike duration (sec): 5, initial cpu usage: 82, average cpu usage: 82, perf taken: 1

 

I don't see any other helpful information, but perhaps I have to enable top_conns_enable and heavy_conns_enable.

Hopefully tomorrow I will have more data.

 

0 Kudos
G_W_Albrecht
Legend Legend
Legend

Other possibility is to fine-tune CUL https://support.checkpoint.com/results/sk/sk92723

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist
Teddy_Brewski
Advisor

Will check -- thank you!

It's not happening during the policy push nor during regular office hours, but only during a specific time at night. I tried to correlate it with the network load and there is a corresponding spike between data centers, so it's clearly some backup job...

0 Kudos
Lesley
Authority Authority
Authority

Let's start with a HCP health check:

Full HCP report - HCP (sk171436) from active member.
         1. # hcp --enable-product Performance
         2. To generate a report, you can run "hcp -r all --include-wts yes". After running that command, you can download the following file: /var/log/hcp/last/hcp_last_report.tar.gz.

Share anything worth sharing here that might explain the issue. Like coredumps, rx errors etc

-------
If you like this post please give a thumbs up(kudo)! 🙂
Teddy_Brewski
Advisor

Hello @Lesley 

All are green with one error (interface errors) and two warnings:

20250724-fw-node2-006xxx.PNG

Warning 1:

Description

Drop templates are enabled but SecureXL drops show 0 packets dropped due to templates

Suggested Solution

Solution ##

Consult with TAC for further assistance

 

Warning 2:

Dmesg analysis

Description

Inspecting Dmesg output

2025-07-24 10:53:59 GW-N02 kernel:[fw4_0];cptls_rl_decrypt: cptls_rl_decrypt_ex failed on line: 532. version = 0x0303, read_state: ffffc90150ad1bb8, write_state: (null). states:

 

Description

Detected dmesg entries, from the last 1 hour/s, which require further inspection

Suggested Solution

Solution #1

Search for the specific error in existing SKs (and follow the suggested comments/solutions)

 

Concerning interface errors: we can't see corresponding CRCs on the switch side (both for bond and individual interfaces).

0 Kudos
Lesley
Authority Authority
Authority

Time to check the rx buffers follow this

https://support.checkpoint.com/results/sk/sk42181

-------
If you like this post please give a thumbs up(kudo)! 🙂
Teddy_Brewski
Advisor

I recall troubleshooting it when we started noticing those after the migration from R77.30, but we were never nearly close to 0.1% as per the post from Timothy_Hall (I'm sorry, I can't find it anymore), hence the default rx value is used for all 10G interfaces which is 512.

He also recommended to avoid using ixgbe drivers for terminating VPN tunnels which is our case (not a lot of tunnels though).

Either way, we don't really experience any performance or latency issues, I just need to find the backup job that causes cluster failover at 3am. 🙂

the_rock
Legend
Legend

Of course, any advice from legend @Timothy_Hall  would be golden!

Andy

0 Kudos
Timothy_Hall
Legend Legend
Legend

The CUL messages are merely a symptom of the problem, not the issue itself.  The CUL state only matters if the active member suffers a catastrophic failover while CUL is active; then that failover will take 8-9 seconds instead of 2-3.

fw ctl multik print_heavy_conn only shows elephant flows over the last 24 hours.  So, if you ran it within the proper window of time and it still shows nothing, that would suggest a large number of smaller connections saturating all CPUs simultaneously, and none of them stand out as a "spike" or elephant. This also explains why you aren't seeing a bunch of spikes logged by the spike detective.  It could even be a burst of new accepted connections from a network monitoring system doing aggressive discovery at that time, check the new connections rate in cpview's history mode for that period, and see if it jumps significantly as that can also drive up the CPUs.

What I would suggest is writing a script that runs the command top_conns every 30 seconds and writes the results to a file.  (sk172229: Top Connections Tool)  Leave it running overnight, and that should help you figure out what is going on, as I assume the regular logs are not providing helpful information.  My guess is that this burst of connections is traversing the slowpath through the gateway, which can easily saturate the CPUs.  fw tab -t connections -z can help you diagnose why these connections are stuck in the slowpath.  See my Be Your Own TAC Part Deux presentation for all the slowpath reasons.  If the connections are not stuck F2F/slowpath, this would be a textbook case for using fast_accel for this spiky traffic, assuming the two communicating systems are reasonably trusted.  sk156672 - SecureXL Fast Accelerator (fw fast_accel)

The RX-DRPs might be related, or they might just be junk traffic the firewall cannot process anyway (IPv6, etc).  We need to understand the properties of these spiky connections before examining these drops/misses.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
Teddy_Brewski
Advisor

Thank you @Timothy_Hall ! I think using top_conns is the right direction.

Stupid question 🙂 -- I can't seem to find a way to schedule a cron job to run every XX seconds using clish since it only accepts hh:mm:

Schedule jobs to run on specified time in day. Please specify time in a format <hours>:<minutes>.

 

Shall I try to place a script using 'crontab -e' and put */30 (every 30 seconds)?  But then it explicitly says not to edit the file directly.

0 Kudos
G_W_Albrecht
Legend Legend
Legend

Find the answer here: sk77300: How to create a scheduled job (cron job) in Gaia with frequency of less than a day

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist
0 Kudos
Timothy_Hall
Legend Legend
Legend

Just do this from expert mode and leave it running overnight (it will run every 30 seconds for 8 hours , increase the 960 value for longer):

TMOUT=0; export TMOUT

for run in {1..960}; do date >>outputfile; top_conns >>outputfile; sleep 30; done

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
Teddy_Brewski
Advisor

Thank you very much @Timothy_Hall !

0 Kudos
the_rock
Legend
Legend

There were some good suggestions in below post

https://community.checkpoint.com/t5/Security-Gateways/IN-VAR-LOG-messages-keep-getting-Stopping-CUL-...

Also, can you send output of below:

fw ctl multik print_heavy_conn

Andy

the_rock
Legend
Legend

@Teddy_Brewski 

Something else to check, see what it gives you, you can do below any time from clish.

Andy

CP-MANAGEMENT> show back
backup - Show the status of the latest backup/restore
backup-scheduled - Show the scheduling of backup defined in the system
backups - List of local backups
CP-MANAGEMENT> show backup-sch
CP-MANAGEMENT> show backup-scheduled
CP-MANAGEMENT> show backup-scheduled
Backup of the system configuration.
CP-MANAGEMENT> show backup-scheduled
CLINFR0349 Incomplete command.
CP-MANAGEMENT> show backup-scheduled jovs
NMSCRN0083 Invalid scheduled-backup name.
CP-MANAGEMENT> show cron
CP-MANAGEMENT> show cron
job - Show the scheduling information of the specific job
jobs - Show all scheduled jobs in the system
mailto - Show email address for sending jobs' results
CP-MANAGEMENT> show cron jobs
CP-MANAGEMENT>

Lesley
Authority Authority
Authority

I think he means backup traffic from other network components going via the firewall and are inspected 😉 

But it is worth to check if scheduled gaia backup matches CPU spikes of course. 

If you know it is backup traffic then focus on following topics: hyper flow, dynamic balancing and fast accel. 

-------
If you like this post please give a thumbs up(kudo)! 🙂
the_rock
Legend
Legend

Definitely all worth checking 🙂

0 Kudos
Teddy_Brewski
Advisor

Yep, exactly that. 🙂

The backup of the gateway itself is configured at 2am, lasts ~30 seconds and has no impact on the performance. This is some task that is happening over the network and it's traversing via the firewall.

 

0 Kudos
Teddy_Brewski
Advisor

Hello @the_rock 

'fw ctl multik print_heavy_conn' returns nothing.

0 Kudos
the_rock
Legend
Legend

Since I like to think about anything in life logically, more I think about this, more Im convinced there must be some sort of backup or cron job going on in the middle of the night causing this. Just my 2 cents.

Andy

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events