Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Lesley
MVP Gold
MVP Gold

Ongoing issue: high CPU load fw_workers + rad

Hi everyone,

At the moment I have an ongoing issue with a customer. Symptoms are as following:

High CPU load:

top - 12:50:06 up 9:41, 3 users, load average: 24.71, 12.38, 6.78
Tasks: 347 total, 30 running, 317 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 81.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 15.4 si, 0.0 st
KiB Mem : 98087944 total, 69572780 free, 15522848 used, 12992316 buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 81122528 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20460 admin 20 0 216456 103184 25532 R 64.4 0.1 32:58.43 rad
11626 admin 20 0 0 0 0 R 55.8 0.0 72:38.48 fw_worker_7
11628 admin 20 0 0 0 0 R 54.6 0.0 70:53.74 fw_worker_9
11623 admin 20 0 0 0 0 R 49.5 0.0 74:40.62 fw_worker_4
11624 admin 20 0 0 0 0 R 44.8 0.0 71:14.27 fw_worker_5
11627 admin 20 0 0 0 0 R 41.3 0.0 71:24.60 fw_worker_8
11625 admin 20 0 0 0 0 R 40.7 0.0 70:17.07 fw_worker_6
11622 admin 20 0 0 0 0 R 38.5 0.0 72:11.25 fw_worker_3
11619 admin 20 0 0 0 0 R 37.5 0.0 74:28.74 fw_worker_0
11620 admin 20 0 0 0 0 R 37.2 0.0 73:07.65 fw_worker_1
19952 admin 20 0 940080 353524 49376 R 31.5 0.4 110:26.24 fw_full

We can see slowly the load increase on the workers and later the RAD daemon.

RAD shows no errors in the rad dir and SmartConsole. CPU spike log is empty. 

Only fix now is to failover to the other member and it starts over. At the moment I have to do failover every 10-15 min.

TAC case is going on as we speak. Wanted to reach out to the community to have a second check, maybe share some ideas.

We also upgraded the setup yesterday from take 113 to 115 R81.20 no improvement . Disabled blades: av ips etc same result. 

 

-------
If you like this post please give a thumbs up(kudo)! 🙂
13 Replies
Timothy_Hall
MVP Gold
MVP Gold

Please provide the output of the following commands, ideally taken while the issue is occurring, prior to failing over:

fwaccel stat

fwaccel stats -s

enabled_blades

netstat -ni

Any chance you are on a Quantum Force 3900/9XXX/19XXX/29XXX or Lightspeed appliance?  UPPAK is in play there.

If URLF is enabled, this could be the URL categorization cache thrashing because you have far more than 1,000 surfing users behind the firewall.  This cache is not synced between cluster members, so a failover would fix the issue temporarily.  It could also be the AV anti-malware cache thrashing, but it sounds like you tried turning off AV, and the issue persisted. 

The only other thing a failover would do is dump all connections out of the Medium Path into the fastpath upon failover by default, which would significantly reduce the CPU load on your firewall worker instances temporarily.  This effect upon failover was discussed in my CPX Presentation which you may want to review.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
Lesley
MVP Gold
MVP Gold

Hi,

Thank you for the reply. This gateway runs kernel mode (open server).

AV is indeed off and I suspect URL filtering as you state. Below the info 🙂 

fwaccel stat
+---------------------------------------------------------------------------------+
|Id|Name |Status |Interfaces |Features |
+---------------------------------------------------------------------------------+
|0 |KPPAK |enabled |eth,eth,et,eth,eth,|Acceleration,Cryptography |
| | | |eth,eth,eth,eth | |
| | | | |Crypto: Tunnel,UDPEncap,MD5, |
| | | | |SHA1,3DES,DES,AES-128,AES-256,|
| | | | |ESP,LinkSelection,DynamicVPN, |
| | | | |NatTraversal,AES-XCBC,SHA256, |
| | | | |SHA384,SHA512 |
+---------------------------------------------------------------------------------+

Accept Templates : enabled
Drop Templates : enabled
NAT Templates : enabled
LightSpeed Accel : disabled

fwaccel stats -s
Accelerated conns/Total conns : 56744/159331 (35%)
LightSpeed conns/Total conns : 0/159331 (0%)
Accelerated pkts/Total pkts : 7073015862/7843049171 (90%)
LightSpeed pkts/Total pkts : 0/7843049171 (0%)
F2Fed pkts/Total pkts : 770033309/7843049171 (9%)
F2V pkts/Total pkts : 46691949/7843049171 (0%)
CPASXL pkts/Total pkts : 0/7843049171 (0%)
PSLXL pkts/Total pkts : 4516731459/7843049171 (57%)
CPAS pipeline pkts/Total pkts : 0/7843049171 (0%)
PSL pipeline pkts/Total pkts : 0/7843049171 (0%)
QOS inbound pkts/Total pkts : 0/7843049171 (0%)
QOS outbound pkts/Total pkts : 0/7843049171 (0%)
Corrected pkts/Total pkts : 0/7843049171 (0%)

enabled_blades
fw urlf appi SSL_INSPECT anti_bot mon

netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0 1500 0 5062275688 0 40121615 0 2126175119 0 0 0 BMmRU
bond1 1500 0 2152513947 0 56495 0 4760528963 0 0 0 BMmRU
bond1. 1500 0 1677398087 0 0 0 3790216944 0 0 0 BMRU
bond1. 1500 0 78383194 0 0 0 57714744 0 0 0 BMRU
bond1. 1500 0 396567854 0 0 0 912479637 0 0 0 BMRU
bond1. 1500 0 162360 0 0 0 121385 0 0 0 BMRU
bond 1500 0 614958571 0 9384 0 292962839 0 0 0 BMmRU
bond. 1500 0 122325 0 0 0 92217 0 0 0 BMRU
bond. 1500 0 148628996 0 0 0 149600414 0 0 0 BMRU
bond. 1500 0 139513 0 0 0 64590 0 0 0 BMRU
bond. 1500 0 2370242 0 0 0 645866 0 0 0 BMRU
bond. 1500 0 892143 0 0 0 658440 0 0 0 BMRU
bond. 1500 0 337976 0 0 0 283213 0 0 0 BMRU
bond. 1500 0 2201586 0 0 0 3536011 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 136321 0 0 0 71281 0 0 0 BMRU
bond. 1500 0 57390 0 0 0 1128 0 0 0 BMRU
bond. 1500 0 2297341 0 0 0 2832088 0 0 0 BMRU
bond. 1500 0 37220 0 0 0 916 0 0 0 BMRU
bond. 1500 0 16648876 0 0 0 6172715 0 0 0 BMRU
bond. 1500 0 405374 0 0 0 182786 0 0 0 BMRU
bond. 1500 0 217043683 0 0 0 24524361 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 220033 0 0 0 461297 0 0 0 BMRU
bond. 1500 0 79231 0 0 0 26952 0 0 0 BMRU
bond. 1500 0 137994 0 0 0 67416 0 0 0 BMRU
bond. 1500 0 1668621 0 0 0 756463 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37037 0 0 0 493 0 0 0 BMRU
bond. 1500 0 401068 0 0 0 1324006 0 0 0 BMRU
bond. 1500 0 10312912 0 0 0 6584546 0 0 0 BMRU
bond. 1500 0 98212 0 0 0 40187 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37427 0 0 0 1357 0 0 0 BMRU
bond. 1500 0 236101 0 0 0 123601 0 0 0 BMRU
bond. 1500 0 530658 0 0 0 166934 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 228858 0 0 0 134749 0 0 0 BMRU
bond. 1500 0 37100 0 0 0 640 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 175716 0 0 0 106860 0 0 0 BMRU
bond. 1500 0 16603109 0 0 0 15064450 0 0 0 BMRU
bond. 1500 0 59945 0 0 0 12711 0 0 0 BMRU
bond. 1500 0 6823638 0 0 0 7417717 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37037 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 82447 0 0 0 28684 0 0 0 BMRU
bond. 1500 0 130923 0 0 0 63626 0 0 0 BMRU
bond. 1500 0 37607 0 0 0 1531 0 0 0 BMRU
bond. 1500 0 37223 0 0 0 949 0 0 0 BMRU
bond. 1500 0 155458468 0 0 0 41252481 0 0 0 BMRU
bond. 1500 0 55405 0 0 0 973 0 0 0 BMRU
bond. 1500 0 299659 0 0 0 251380 0 0 0 BMRU
bond. 1500 0 142547 0 0 0 68426 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 200022 0 0 0 134207 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 114065 0 0 0 74458 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 138096 0 0 0 76497 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 121315 0 0 0 52825 0 0 0 BMRU
bond. 1500 0 697260 0 0 0 299303 0 0 0 BMRU
bond. 1500 0 439440 0 0 0 424124 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 157129 0 0 0 89365 0 0 0 BMRU
bond. 1500 0 37037 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37037 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37130 0 0 0 790 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 374450 0 0 0 358326 0 0 0 BMRU
bond. 1500 0 361316 0 0 0 505759 0 0 0 BMRU
bond. 1500 0 212428 0 0 0 141488 0 0 0 BMRU
bond. 1500 0 197000 0 0 0 88029 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 51423 0 0 0 32713 0 0 0 BMRU
bond. 1500 0 37037 0 0 0 469 0 0 0 BMRU
bond. 1500 0 101164 0 0 0 7108444 0 0 0 BMRU
bond. 1500 0 2732141 0 0 0 2648311 0 0 0 BMRU
bond. 1500 0 130902 0 0 0 80956 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 11641791 0 0 0 5568622 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 158895 0 0 0 105245 0 0 0 BMRU
bond. 1500 0 157470 0 0 0 106759 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 37034 0 0 0 469 0 0 0 BMRU
bond. 1500 0 10197392 0 0 0 12422460 0 0 0 BMRU
bond3 1500 0 19429548 0 0 0 34003635 0 0 0 BMmRU
eth 1500 0 1585842395 0 42069 0 2466758424 0 0 0 BMsRU
eth 1500 0 566670978 0 14426 0 2293763935 0 0 0 BMsRU
eth 1500 0 314549649 0 5632 0 138402512 0 0 0 BMsRU
eth 1500 0 300402150 0 3752 0 154560077 0 0 0 BMsRU
eth 1500 0 2605246905 0 20795131 0 1111371041 0 0 0 BMsRU
eth 1500 0 2457032749 0 19326484 0 1014805116 0 0 0 BMsRU
eth 1500 0 17752963 0 0 0 14078407 0 0 0 BMRU
eth 1500 0 11090635 0 0 0 16358918 0 0 0 BMsRU
eth 1500 0 8338913 0 0 0 17644717 0 0 0 BMsRU
lo 65536 0 690832 0 0 0 690832 0 0 0 ALdRU
-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

All those outputs look OK, pretty sure this is a cache thrash issue caused by AB and/or URLF, see the last two paragraphs on the second page, which is quoted from the most recent edition of my Gateway Performance Course:

rad11.pngrad12.png

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos
the_rock
MVP Gold
MVP Gold

Apart from what Tim asked for, maybe send us below as well.

Andy

************

fw tab -t connections -s

fw ctl multik print_heavy_conn

0 Kudos
D_W
Advisor

Hi,

in URL-filtering-blade-RAD-process-causing-high-CPU-tip we discussed a few RAD issues.
In our case with high CPU+RAD we had to disable the RAD autodebug option with sk182859 

Cheers!

the_rock
MVP Gold
MVP Gold

Good call @D_W 

0 Kudos
Lesley
MVP Gold
MVP Gold

Hi,

Thanks for the tip, autodebug is already disabled 🙂

We see loads of the following RAD error. 

FlowError=RAD request exceeded maximum handing time

On the other hand, CPU issue is still there and there can be 0 RAD errors at the moment. They have been gone all done and just popped up:

grep "FlowError=" $FWDIR/log/rad_events/Errors/* | grep -oP '(?<=FlowError=).*' | sort | uniq -c | sort -nr
483 RAD request exceeded maximum handing time
15 Failed to fetch Check Point resources. Timeout was reached
14 Failed to fetch Check Point resources. Couldn't resolve host name
1 Failed to fetch Check Point resources. Couldn't connect to server

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
_Val_
Admin
Admin

That error is about connectivity issues. What is using CPu now, RAD or FW workers?

0 Kudos
Lesley
MVP Gold
MVP Gold

Hi Val,

First we see increased load on the fw_workers, shortly after RAD joins aswell with high load. 

RAD errors have been clear most of the day today. We experienced high load without any RAD errors in de relevant folder. 

 

This is how it looks '''mid'' issue. Customer notice issues around load average of 25

top - 17:36:03 up 14:27, 4 users, load average: 10.60, 7.64, 7.45
Tasks: 350 total, 19 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.0 us, 78.1 sy, 0.0 ni, 11.4 id, 0.0 wa, 0.5 hi, 6.0 si, 0.0 st
KiB Mem : 98087944 total, 49163212 free, 15571648 used, 33353084 buff/cache
KiB Swap: 67108860 total, 67108860 free, 0 used. 81177276 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19952 admin 20 0 962644 374820 49380 R 86.5 0.4 200:43.26 fw_full
11622 admin 20 0 0 0 0 R 69.4 0.0 133:56.11 fw_worker_3
11619 admin 20 0 0 0 0 R 62.3 0.0 137:19.11 fw_worker_0
11625 admin 20 0 0 0 0 R 61.6 0.0 130:51.28 fw_worker_6
11624 admin 20 0 0 0 0 R 60.0 0.0 133:16.94 fw_worker_5
11626 admin 20 0 0 0 0 R 57.4 0.0 134:03.56 fw_worker_7
11623 admin 20 0 0 0 0 R 57.1 0.0 136:49.37 fw_worker_4
11621 admin 20 0 0 0 0 R 56.5 0.0 133:35.66 fw_worker_2
11628 admin 20 0 0 0 0 R 54.2 0.0 131:37.48 fw_worker_9
11627 admin 20 0 0 0 0 R 53.5 0.0 132:21.71 fw_worker_8
20460 admin 20 0 220916 111744 28284 R 53.2 0.1 60:11.37 rad
11620 admin 20 0 0 0 0 R 44.5 0.0 134:50.44 fw_worker_1

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
D_W
Advisor

Do you use external DNS servers like 9.9.9.9? They will eventually block the requests due to too many requests/minute.
Or maybe you hit a limit at 

$FWDIR/conf/rad_conf.C:

:max_flows (1000)

0 Kudos
Lesley
MVP Gold
MVP Gold

The CSV file did not displayed max flows today, only this morning. TAC noticed that we reached the cap but was not needed to increase the max flow value. FW's connect towards internal infoblox server. After that I am unaware, could ask if important 🙂 

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Lesley
MVP Gold
MVP Gold

Update: we suspect customer was under attack. I noticed following logs:

SYN Defender: activated <interface>. Number of not established connections is 5017

After 5000 Syn defender kicks in and does the following (copied from SK):

When the Gateway decides that a server is under attack, it switches to SYN Relay Defense. SYN Relay counters the attack by making sure that the three-way handshake is complete before sending a SYN packet to the connection's destination.

Even if the destination server is not listening on that port, the Gateway will respond with a SYN-ACK to make sure that the client completes the three-way handshake with an ACK; it does this to determine the legitimacy of the connection. After the Gateway has determined that the connection is legitimate, it forwards the packet to the firewall layer and eventually to the destination server

--------------

So after i disabled this protection load went down. Customer was still under attack and firewall dropped still traffic. But the above protection is a critical performance one. Load went down and fw went stable after this. We blocked the attack(before the fw) and enabled protection again. 

I see loads of host / port scans. If firewall is gonna reply to them due above protection I can imagine it struggles with it. 

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

Enable the SecureXL penalty box feature which will help a lot.  It should be enabled by default as far as I am concerned.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events