Re: Help needed: Load analysis

Teddy_Brewski · ‎2025-07-24

Hello,

I'd appreciate any tips or hints in finding out the traffic that causes cluster fail-over for some minutes.

We use R81.20 with the latest Take (105) on open servers. Every night, between 3am and 4am, there is high CPU observed for some minutes which triggers fail-over to the standby node, back and forth.

It looks like some sort of scheduled backup happening, but I can't figure out IP addresses involved.

I get the following in /var/log/messages:

Jul 24 03:46:10 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (82%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:46:41 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:46:43 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:46:55 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:47:00 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:47:19 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 19 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 03:47:40 2025 GW-N02 kernel:[fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
Jul 24 03:47:41 2025 GW-N02 kernel:[fw4_1];CLUS-114904-1: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Jul 24 03:47:49 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (81%) on the remote member 2 increased above the configured threshold (80%). 
Jul 24 03:48:01 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 04:04:54 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (85%) on the local member increased above the configured threshold (80%).
Jul 24 04:05:58 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
Jul 24 04:06:02 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (92%) on the local member increased above the configured threshold (80%).
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Jul 24 04:06:36 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 18 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

Trying to replay with 'cpview -t' doesn't help much.

At 03:46-03:47 no high CPU is recorded (I think it didn't fit into the 1 minute interval used with 'cpview -t'):

At 04:06, I do see high load which goes to normal 2 minutes after:

I was hoping to get some stats from CPU > Top-Connections, but nothing was logged:

Thank you in advance.

G_W_Albrecht · ‎2025-07-24

https://sc1.checkpoint.com/documents/R81.20/WebAdminGuides/EN/CP_R81.20_PerformanceTuning_AdminGuide...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Teddy_Brewski · ‎2025-07-24

Thank you @G_W_Albrecht !

As per SK the tool seems to be already running. I see logs generated under /var/log/spike_detective and an entry that seems to correspond to the time of the issue:

spike info: type: cpu, cpu core: 3, top consumer: unknown, start time: 24/07/25 03:01:48, spike duration (sec): 5, initial cpu usage: 82, average cpu usage: 82, perf taken: 1

I don't see any other helpful information, but perhaps I have to enable top_conns_enable and heavy_conns_enable.

Hopefully tomorrow I will have more data.

G_W_Albrecht · ‎2025-07-24

Other possibility is to fine-tune CUL https://support.checkpoint.com/results/sk/sk92723

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Teddy_Brewski · ‎2025-07-24

Will check -- thank you!

It's not happening during the policy push nor during regular office hours, but only during a specific time at night. I tried to correlate it with the network load and there is a corresponding spike between data centers, so it's clearly some backup job...

Lesley · ‎2025-07-24

Let's start with a HCP health check:

Full HCP report - HCP (sk171436) from active member.
1. # hcp --enable-product Performance
2. To generate a report, you can run "hcp -r all --include-wts yes". After running that command, you can download the following file: /var/log/hcp/last/hcp_last_report.tar.gz.

Share anything worth sharing here that might explain the issue. Like coredumps, rx errors etc

-------
Please press "Accept as Solution" if my post solved it 🙂

Teddy_Brewski · ‎2025-07-24

Hello @Lesley

All are green with one error (interface errors) and two warnings:

Warning 1:

Description

Drop templates are enabled but SecureXL drops show 0 packets dropped due to templates

Suggested Solution

Solution ##

Consult with TAC for further assistance

Warning 2:

Dmesg analysis

Description

Inspecting Dmesg output

2025-07-24 10:53:59 GW-N02 kernel:[fw4_0];cptls_rl_decrypt: cptls_rl_decrypt_ex failed on line: 532. version = 0x0303, read_state: ffffc90150ad1bb8, write_state: (null). states:

Description

Detected dmesg entries, from the last 1 hour/s, which require further inspection

Suggested Solution

Solution #1

Search for the specific error in existing SKs (and follow the suggested comments/solutions)

Concerning interface errors: we can't see corresponding CRCs on the switch side (both for bond and individual interfaces).

Lesley · ‎2025-07-24

Time to check the rx buffers follow this

https://support.checkpoint.com/results/sk/sk42181

-------
Please press "Accept as Solution" if my post solved it 🙂

Teddy_Brewski · ‎2025-07-24

I recall troubleshooting it when we started noticing those after the migration from R77.30, but we were never nearly close to 0.1% as per the post from Timothy_Hall (I'm sorry, I can't find it anymore), hence the default rx value is used for all 10G interfaces which is 512.

He also recommended to avoid using ixgbe drivers for terminating VPN tunnels which is our case (not a lot of tunnels though).

Either way, we don't really experience any performance or latency issues, I just need to find the backup job that causes cluster failover at 3am. 🙂

the_rock · ‎2025-07-24

Of course, any advice from legend @Timothy_Hall would be golden!

Andy

Best,
Andy

Timothy_Hall · ‎2025-07-25

The CUL messages are merely a symptom of the problem, not the issue itself. The CUL state only matters if the active member suffers a catastrophic failover while CUL is active; then that failover will take 8-9 seconds instead of 2-3.

fw ctl multik print_heavy_conn only shows elephant flows over the last 24 hours. So, if you ran it within the proper window of time and it still shows nothing, that would suggest a large number of smaller connections saturating all CPUs simultaneously, and none of them stand out as a "spike" or elephant. This also explains why you aren't seeing a bunch of spikes logged by the spike detective. It could even be a burst of new accepted connections from a network monitoring system doing aggressive discovery at that time, check the new connections rate in cpview's history mode for that period, and see if it jumps significantly as that can also drive up the CPUs.

What I would suggest is writing a script that runs the command top_conns every 30 seconds and writes the results to a file. (sk172229: Top Connections Tool) Leave it running overnight, and that should help you figure out what is going on, as I assume the regular logs are not providing helpful information. My guess is that this burst of connections is traversing the slowpath through the gateway, which can easily saturate the CPUs. fw tab -t connections -z can help you diagnose why these connections are stuck in the slowpath. See my Be Your Own TAC Part Deux presentation for all the slowpath reasons. If the connections are not stuck F2F/slowpath, this would be a textbook case for using fast_accel for this spiky traffic, assuming the two communicating systems are reasonably trusted. sk156672 - SecureXL Fast Accelerator (fw fast_accel)

The RX-DRPs might be related, or they might just be junk traffic the firewall cannot process anyway (IPv6, etc). We need to understand the properties of these spiky connections before examining these drops/misses.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Teddy_Brewski · ‎2025-07-28

Thank you @Timothy_Hall ! I think using top_conns is the right direction.

Stupid question 🙂 -- I can't seem to find a way to schedule a cron job to run every XX seconds using clish since it only accepts hh:mm:

Schedule jobs to run on specified time in day. Please specify time in a format <hours>:<minutes>.

Shall I try to place a script using 'crontab -e' and put */30 (every 30 seconds)? But then it explicitly says not to edit the file directly.

G_W_Albrecht · ‎2025-07-28

Find the answer here: sk77300: How to create a scheduled job (cron job) in Gaia with frequency of less than a day

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Timothy_Hall · ‎2025-07-28

Just do this from expert mode and leave it running overnight (it will run every 30 seconds for 8 hours , increase the 960 value for longer):

TMOUT=0; export TMOUT

for run in {1..960}; do date >>outputfile; top_conns >>outputfile; sleep 30; done

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Teddy_Brewski · ‎2025-07-28

Thank you very much @Timothy_Hall !

the_rock · ‎2025-07-24

There were some good suggestions in below post

https://community.checkpoint.com/t5/Security-Gateways/IN-VAR-LOG-messages-keep-getting-Stopping-CUL-...

Also, can you send output of below:

fw ctl multik print_heavy_conn

Andy

Best,
Andy

the_rock · ‎2025-07-24

@Teddy_Brewski

Something else to check, see what it gives you, you can do below any time from clish.

Andy

CP-MANAGEMENT> show back
backup - Show the status of the latest backup/restore
backup-scheduled - Show the scheduling of backup defined in the system
backups - List of local backups
CP-MANAGEMENT> show backup-sch
CP-MANAGEMENT> show backup-scheduled
CP-MANAGEMENT> show backup-scheduled
Backup of the system configuration.
CP-MANAGEMENT> show backup-scheduled
CLINFR0349 Incomplete command.
CP-MANAGEMENT> show backup-scheduled jovs
NMSCRN0083 Invalid scheduled-backup name.
CP-MANAGEMENT> show cron
CP-MANAGEMENT> show cron
job - Show the scheduling information of the specific job
jobs - Show all scheduled jobs in the system
mailto - Show email address for sending jobs' results
CP-MANAGEMENT> show cron jobs
CP-MANAGEMENT>

Best,
Andy

Lesley · ‎2025-07-24

I think he means backup traffic from other network components going via the firewall and are inspected 😉

But it is worth to check if scheduled gaia backup matches CPU spikes of course.

If you know it is backup traffic then focus on following topics: hyper flow, dynamic balancing and fast accel.

-------
Please press "Accept as Solution" if my post solved it 🙂

the_rock · ‎2025-07-24

Definitely all worth checking 🙂

Best,
Andy

Teddy_Brewski · ‎2025-07-24

Yep, exactly that. 🙂

The backup of the gateway itself is configured at 2am, lasts ~30 seconds and has no impact on the performance. This is some task that is happening over the network and it's traversing via the firewall.

Teddy_Brewski · ‎2025-07-24

Hello @the_rock

'fw ctl multik print_heavy_conn' returns nothing.

the_rock · ‎2025-07-24

Since I like to think about anything in life logically, more I think about this, more Im convinced there must be some sort of backup or cron job going on in the middle of the night causing this. Just my 2 cents.

Andy

Best,
Andy

Are you a member of CheckMates?

Help needed: Load analysis

Dmesg analysis

Description