- Products
- Learn
- Local User Groups
- Partners
- More
Firewall Uptime, Reimagined
How AIOps Simplifies Operations and Prevents Outages
Introduction to Lakera:
Securing the AI Frontier!
Check Point Named Leader
2025 Gartner® Magic Quadrant™ for Hybrid Mesh Firewall
HTTPS Inspection
Help us to understand your needs better
CheckMates Go:
SharePoint CVEs and More!
Hi everyone,
At the moment I have an ongoing issue with a customer. Symptoms are as following:
High CPU load:
top - 12:50:06 up 9:41, 3 users, load average: 24.71, 12.38, 6.78
Tasks: 347 total, 30 running, 317 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 81.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 15.4 si, 0.0 st
KiB Mem : 98087944 total, 69572780 free, 15522848 used, 12992316 buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 81122528 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20460 admin 20 0 216456 103184 25532 R 64.4 0.1 32:58.43 rad
11626 admin 20 0 0 0 0 R 55.8 0.0 72:38.48 fw_worker_7
11628 admin 20 0 0 0 0 R 54.6 0.0 70:53.74 fw_worker_9
11623 admin 20 0 0 0 0 R 49.5 0.0 74:40.62 fw_worker_4
11624 admin 20 0 0 0 0 R 44.8 0.0 71:14.27 fw_worker_5
11627 admin 20 0 0 0 0 R 41.3 0.0 71:24.60 fw_worker_8
11625 admin 20 0 0 0 0 R 40.7 0.0 70:17.07 fw_worker_6
11622 admin 20 0 0 0 0 R 38.5 0.0 72:11.25 fw_worker_3
11619 admin 20 0 0 0 0 R 37.5 0.0 74:28.74 fw_worker_0
11620 admin 20 0 0 0 0 R 37.2 0.0 73:07.65 fw_worker_1
19952 admin 20 0 940080 353524 49376 R 31.5 0.4 110:26.24 fw_full
We can see slowly the load increase on the workers and later the RAD daemon.
RAD shows no errors in the rad dir and SmartConsole. CPU spike log is empty.
Only fix now is to failover to the other member and it starts over. At the moment I have to do failover every 10-15 min.
TAC case is going on as we speak. Wanted to reach out to the community to have a second check, maybe share some ideas.
We also upgraded the setup yesterday from take 113 to 115 R81.20 no improvement . Disabled blades: av ips etc same result.
Please provide the output of the following commands, ideally taken while the issue is occurring, prior to failing over:
fwaccel stat
fwaccel stats -s
enabled_blades
netstat -ni
Any chance you are on a Quantum Force 3900/9XXX/19XXX/29XXX or Lightspeed appliance? UPPAK is in play there.
If URLF is enabled, this could be the URL categorization cache thrashing because you have far more than 1,000 surfing users behind the firewall. This cache is not synced between cluster members, so a failover would fix the issue temporarily. It could also be the AV anti-malware cache thrashing, but it sounds like you tried turning off AV, and the issue persisted.
The only other thing a failover would do is dump all connections out of the Medium Path into the fastpath upon failover by default, which would significantly reduce the CPU load on your firewall worker instances temporarily. This effect upon failover was discussed in my CPX Presentation which you may want to review.
Hi,
Thank you for the reply. This gateway runs kernel mode (open server).
AV is indeed off and I suspect URL filtering as you state. Below the info 🙂
fwaccel stat
+---------------------------------------------------------------------------------+
|Id|Name |Status |Interfaces |Features |
+---------------------------------------------------------------------------------+
|0 |KPPAK |enabled |eth,eth,et,eth,eth,|Acceleration,Cryptography |
| | | |eth,eth,eth,eth | |
| | | | |Crypto: Tunnel,UDPEncap,MD5, |
| | | | |SHA1,3DES,DES,AES-128,AES-256,|
| | | | |ESP,LinkSelection,DynamicVPN, |
| | | | |NatTraversal,AES-XCBC,SHA256, |
| | | | |SHA384,SHA512 |
+---------------------------------------------------------------------------------+
Accept Templates : enabled
Drop Templates : enabled
NAT Templates : enabled
LightSpeed Accel : disabled
fwaccel stats -s
Accelerated conns/Total conns : 56744/159331 (35%)
LightSpeed conns/Total conns : 0/159331 (0%)
Accelerated pkts/Total pkts : 7073015862/7843049171 (90%)
LightSpeed pkts/Total pkts : 0/7843049171 (0%)
F2Fed pkts/Total pkts : 770033309/7843049171 (9%)
F2V pkts/Total pkts : 46691949/7843049171 (0%)
CPASXL pkts/Total pkts : 0/7843049171 (0%)
PSLXL pkts/Total pkts : 4516731459/7843049171 (57%)
CPAS pipeline pkts/Total pkts : 0/7843049171 (0%)
PSL pipeline pkts/Total pkts : 0/7843049171 (0%)
QOS inbound pkts/Total pkts : 0/7843049171 (0%)
QOS outbound pkts/Total pkts : 0/7843049171 (0%)
Corrected pkts/Total pkts : 0/7843049171 (0%)
enabled_blades
fw urlf appi SSL_INSPECT anti_bot mon
All those outputs look OK, pretty sure this is a cache thrash issue caused by AB and/or URLF, see the last two paragraphs on the second page, which is quoted from the most recent edition of my Gateway Performance Course:
Apart from what Tim asked for, maybe send us below as well.
Andy
************
fw tab -t connections -s
fw ctl multik print_heavy_conn
Hi,
in URL-filtering-blade-RAD-process-causing-high-CPU-tip we discussed a few RAD issues.
In our case with high CPU+RAD we had to disable the RAD autodebug option with sk182859
Cheers!
Good call @D_W
Hi,
Thanks for the tip, autodebug is already disabled 🙂
We see loads of the following RAD error.
FlowError=RAD request exceeded maximum handing time
On the other hand, CPU issue is still there and there can be 0 RAD errors at the moment. They have been gone all done and just popped up:
grep "FlowError=" $FWDIR/log/rad_events/Errors/* | grep -oP '(?<=FlowError=).*' | sort | uniq -c | sort -nr
483 RAD request exceeded maximum handing time
15 Failed to fetch Check Point resources. Timeout was reached
14 Failed to fetch Check Point resources. Couldn't resolve host name
1 Failed to fetch Check Point resources. Couldn't connect to server
That error is about connectivity issues. What is using CPu now, RAD or FW workers?
Hi Val,
First we see increased load on the fw_workers, shortly after RAD joins aswell with high load.
RAD errors have been clear most of the day today. We experienced high load without any RAD errors in de relevant folder.
This is how it looks '''mid'' issue. Customer notice issues around load average of 25
top - 17:36:03 up 14:27, 4 users, load average: 10.60, 7.64, 7.45
Tasks: 350 total, 19 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.0 us, 78.1 sy, 0.0 ni, 11.4 id, 0.0 wa, 0.5 hi, 6.0 si, 0.0 st
KiB Mem : 98087944 total, 49163212 free, 15571648 used, 33353084 buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 81177276 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19952 admin 20 0 962644 374820 49380 R 86.5 0.4 200:43.26 fw_full
11622 admin 20 0 0 0 0 R 69.4 0.0 133:56.11 fw_worker_3
11619 admin 20 0 0 0 0 R 62.3 0.0 137:19.11 fw_worker_0
11625 admin 20 0 0 0 0 R 61.6 0.0 130:51.28 fw_worker_6
11624 admin 20 0 0 0 0 R 60.0 0.0 133:16.94 fw_worker_5
11626 admin 20 0 0 0 0 R 57.4 0.0 134:03.56 fw_worker_7
11623 admin 20 0 0 0 0 R 57.1 0.0 136:49.37 fw_worker_4
11621 admin 20 0 0 0 0 R 56.5 0.0 133:35.66 fw_worker_2
11628 admin 20 0 0 0 0 R 54.2 0.0 131:37.48 fw_worker_9
11627 admin 20 0 0 0 0 R 53.5 0.0 132:21.71 fw_worker_8
20460 admin 20 0 220916 111744 28284 R 53.2 0.1 60:11.37 rad
11620 admin 20 0 0 0 0 R 44.5 0.0 134:50.44 fw_worker_1
Do you use external DNS servers like 9.9.9.9? They will eventually block the requests due to too many requests/minute.
Or maybe you hit a limit at
$FWDIR/conf/rad_conf.C:
:max_flows (1000)
The CSV file did not displayed max flows today, only this morning. TAC noticed that we reached the cap but was not needed to increase the max flow value. FW's connect towards internal infoblox server. After that I am unaware, could ask if important 🙂
Update: we suspect customer was under attack. I noticed following logs:
SYN Defender: activated <interface>. Number of not established connections is 5017
After 5000 Syn defender kicks in and does the following (copied from SK):
When the Gateway decides that a server is under attack, it switches to SYN Relay Defense. SYN Relay counters the attack by making sure that the three-way handshake is complete before sending a SYN packet to the connection's destination.
Even if the destination server is not listening on that port, the Gateway will respond with a SYN-ACK to make sure that the client completes the three-way handshake with an ACK; it does this to determine the legitimacy of the connection. After the Gateway has determined that the connection is legitimate, it forwards the packet to the firewall layer and eventually to the destination server
--------------
So after i disabled this protection load went down. Customer was still under attack and firewall dropped still traffic. But the above protection is a critical performance one. Load went down and fw went stable after this. We blocked the attack(before the fw) and enabled protection again.
I see loads of host / port scans. If firewall is gonna reply to them due above protection I can imagine it struggles with it.
Enable the SecureXL penalty box feature which will help a lot. It should be enabled by default as far as I am concerned.
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
User | Count |
---|---|
12 | |
12 | |
11 | |
7 | |
7 | |
6 | |
5 | |
5 | |
5 | |
5 |
Tue 07 Oct 2025 @ 10:00 AM (CEST)
Cloud Architect Series: AI-Powered API Security with CloudGuard WAFThu 09 Oct 2025 @ 10:00 AM (CEST)
CheckMates Live BeLux: Discover How to Stop Data Leaks in GenAI Tools: Live Demo You Can’t Miss!Thu 09 Oct 2025 @ 10:00 AM (CEST)
CheckMates Live BeLux: Discover How to Stop Data Leaks in GenAI Tools: Live Demo You Can’t Miss!Wed 22 Oct 2025 @ 11:00 AM (EDT)
Firewall Uptime, Reimagined: How AIOps Simplifies Operations and Prevents OutagesAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY