Re: High CPU VSX PSLXL R80.20 - how to debug fw_wo...

Martin_Oles · ‎2020-11-23

Hi all,

a customer is facing high cpu on VSX, but even with TAC assisting, I am struggling to figure out, how to debug which blade or which part of fw_worker is causing it.

System is R80.20 Take 183. On gateway are running some virtual systems, affected only virtual system 1. There are blades activated, And a week ago, suddenly CPUs started to peak from 40% to 100%. Reboot did not help, traffic switchover also had no effect. Turning off some of "not so important" blades had minimal impact as well.

I am suspecting, that there might be issue with DNS, ActiveDirectory or something seemingly unrelated, but currently I am having no evidence.

Could you please just give me any hint, where to look what might causing fw_worker to use so much CPU time?

Thank you.

Please find output from some commands:

[Expert@gateway:0]# cpinfo -y all

This is Check Point CPinfo Build 914000202 for GAIA
[IDA]
No hotfixes..

[CPFC]
HOTFIX_R80_20_JUMBO_HF_MAIN Take: 183

[MGMT]
HOTFIX_R80_20_JUMBO_HF_MAIN Take: 183

[FW1]
HOTFIX_R80_20_JUMBO_HF_MAIN Take: 183

FW1 build number:
This is Check Point's software version R80.20 - Build 256
kernel: R80.20 - Build 255

[SecurePlatform]
HOTFIX_R80_20_JUMBO_HF_MAIN Take: 183

[CPinfo]
No hotfixes..

Only this particular virtual system affected, over all, not so much traffic there:

[Expert@gateway:0]# vsx stat -l
...
VSID: 1
VRID: 1
Type: Virtual System
Name: virtualsystem1
Security Policy: policy
Installed at: 18Nov2020 8:38:19
SIC Status: Trust
Connections number: 69145
Connections peak: 72530
Connections limit: 99900

We have tried to switch off AntiVirus, AntiBot blades so far, as system is in production:

[Expert@gateway:1]# enabled_blades
fw urlf av appi identityServer anti_bot mon

Affinity looks fine, we have added additional CPU to virtual system, but given in account sudden increase of CPU time with the same amount of traffic, I would prefer more sophisticated solution than only "more power".

[Expert@gateway:0]# fw ctl affinity -l
Mgmt: CPU 0
Sync: CPU 1
eth1-01: CPU 1
eth1-02: CPU 0
eth1-03: CPU 0
eth1-04: CPU 1
eth1-05: CPU 1
eth3-01: CPU 0
eth3-02: CPU 0
eth3-03: CPU 1
eth3-04: CPU 1
VS_0 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_1 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_2 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_3 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_4 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_5 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_6 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
VS_7 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

[Expert@gateway:0]# top
Tasks: 415 total, 1 running, 414 sleeping, 0 stopped, 0 zombie
Cpu(s): 6.0%us, 1.6%sy, 0.0%ni, 91.2%id, 0.1%wa, 0.1%hi, 1.1%si, 0.0%st
Mem: 32778976k total, 15861368k used, 16917608k free, 389068k buffers
Swap: 18892344k total, 0k used, 18892344k free, 6372636k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20156 admin 0 -20 3560m 2.8g 300m S 594 8.9 5069:25 fwk1_dev_0
8156 admin 0 -20 772m 167m 57m S 6 0.5 57:17.01 fwk6_dev_0
20605 admin 16 0 299m 74m 39m S 6 0.2 42:36.06 cpd
25484 admin 15 0 751m 221m 40m S 6 0.7 151:10.55 fw_full
3683 admin 15 0 256m 144m 53m S 4 0.5 482:48.48 rad
9099 admin 16 0 296m 67m 39m S 4 0.2 19:33.27 cpd
...
8456 admin 3 -20 3175m 2.5g 236m S 87 7.8 432:49.48 fwk1_3
8457 admin 0 -20 3175m 2.5g 236m R 84 7.8 429:57.89 fwk1_4
8458 admin 0 -20 3175m 2.5g 236m R 80 7.8 422:10.97 fwk1_5
8454 admin 0 -20 3175m 2.5g 236m S 79 7.8 457:55.30 fwk1_1
8455 admin 0 -20 3175m 2.5g 236m R 72 7.8 429:56.26 fwk1_2
8453 admin 0 -20 3175m 2.5g 236m R 72 7.8 434:19.71 fwk1_0

_Val_ · ‎2020-11-23

Which blades are enabled? How long are the spikes? Is it an Internet facing VS?

I would start by checking if SXL/CoreXL statistics are bending during the spikes, and rule out the usual suspects: heavy connections & deep policy drops (cleanup rule). Do you have drop optimization enabled?

Since those are FWKs, could be either F2F or PXL. I would look for former at the beginning.

I have seen quite a few cases where internet scans were doing quite an impact on Internet facing VS. Could be consistent with your symptoms:

seem random and unrelated to production traffic
cannot be cured by failover or reboot

How many cores do you run on that VS?

Martin_Oles · ‎2020-11-23

Enabled blades:

fw urlf av appi identityServer anti_bot mon

Problem is, that those are not typical spikes, it is continually 80% CPU consumption with 8 cores during office hours, two weeks ago it was around 50% with 5 cores presumably with the same traffic pattern.

Yes, it is external firewall, so it can get traffic directly from the internet.

Reboot did not helped. Switchover to other member had the same result - high CPU.

No known configuration change on firewall prior issue started.

_Val_ · ‎2020-11-23

Most probably not AI. Was the situation changing gradually? Can you put a finger on a moment in time it changed?

Any chance to turn off, or simplify policy for the rest of the blades? Also, acceleration and corexl statistics would help, plus fwaccel stat output

Martin_Oles · ‎2020-11-23

Customer is claiming, that issue has started November 18th. With CPviwer I identified November 17th at 11.00 , when I can see on graphs sudden spike in CPU interrupts from 1Million To 1,1 Billion, the same time counter Interface errors started showing error rate 50k, even output from netstat shows 0. Further more Interfaces Throughput dropped just prior events from 715Mbps to 0 for moment. Nothing relevant found from local logs, traffic logs does not show anything unusual, at least not clearly visible.

Output from fwaccel looks as I would expected, majority of traffic goes via PSLXL.

[Expert@gateway:1]# fwaccel stats -s
Accelerated conns/Total conns : 1298/73821 (1%)
Accelerated pkts/Total pkts   : 12358235182/12405674200 (99%)
F2Fed pkts/Total pkts         : 47439018/12405674200 (0%)
F2V pkts/Total pkts           : 47430970/12405674200 (0%)
CPASXL pkts/Total pkts        : 0/12405674200 (0%)
PSLXL pkts/Total pkts         : 11758122745/12405674200 (94%)
CPAS inline pkts/Total pkts   : 0/12405674200 (0%)
PSL inline pkts/Total pkts    : 0/12405674200 (0%)
QOS inbound pkts/Total pkts   : 0/12405674200 (0%)
QOS outbound pkts/Total pkts  : 0/12405674200 (0%)
Corrected pkts/Total pkts     : 0/12405674200 (0%)

Drop templating is also activated:

[Expert@fwcluster01:1]# fwaccel stats -d
Reason                Value              Reason                Value
--------------------  ---------------    --------------------  ---------------
General                           967    CPASXL Decision                     0
PSLXL Decision                1285558    Clear Packet on VPN                 0
Encryption Failed                   0    Drop Template                  286770
Decryption Failed                   0    Interface Down                      0
Cluster Error                       0    XMT Error                           0
Anti-Spoofing                  194825    Local Spoofing                   8506
Sanity Error                      209    Monitored Spoofed                   0
QXL Decision                        0    C2S Violation                       0
S2C Violation                       0    Loop Prevention                     0
DOS Fragments                       0    DOS IP Options                      0
DOS Blacklists                      0    DOS Penalty Box                     0
DOS Rate Limiting                   0    Syn Attack                          0
Reorder                          1025    Virt Defrag Timeout              7368

Errors on interfaces:

[Expert@gateway:0]# netstat -ni
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
Mgmt       1500   0 24089392      0      0      0 30078984      0      0      0 BMRU
Sync       1500   0 106256569      0    384      0 155747864      0      0      0 BMRU
bond1      1500   0 2814106924      0      0      0 4647754685      0      0      0 BMmRU
bond2      1500   0 7756011045      0      0      0 5639537713      0      0      0 BMmRU
eth1-01    1500   0 123220626      0      0      0 143293137      0      0      0 BMRU
eth1-02    1500   0        0      0      0      0        0      0      0      0 BMsU
eth1-03    1500   0 5013412227      3  73523  73523 4622912696      0      0      0 BMRU
eth1-04    1500   0 32681606      0      0      0 31840976      0      0      0 BMRU
eth1-05    1500   0 418773297      0      0      0 1003323082      0      0      0 BMRU
eth3-01    1500   0 1797471464      0      0      0 4498085216      0      0      0 BMsRU
eth3-02    1500   0 1016637167      0      0      0 149672804      0      0      0 BMsRU
eth3-03    1500   0  6412175      0      0      0 5634342714      0      0      0 BMsRU
eth3-04    1500   0 7749603797      0      0      0  5198377      0      0      0 BMsRU
lo        16436   0  2284471      0      0      0  2284471      0      0      0 LRU

_Val_ · ‎2020-11-23

I see here practically no accelerated connections. Matching the first packet through rulebase might also cause performance degradation, especially with 60-70k concurrent connections.

fwaccel stat should show if templates are disabled. Also, any policy changes around November 17 at 11? Audit logs should show some

Martin_Oles · ‎2020-11-23

That is correct, there are activated blades fw urlf av appi identityServer anti_bot mon, then I would say it is correct, that traffic goes through PSLXL.

Templates are enabled.

I am suspecting, that one of blades might causing high cpu, but could not figure out, how to proof it. I have tried to disable some of them, which I have considered safe to do without maintenance window, but no effect. CPUs are busy with fwk_1_ processes.

CheckPointerXL · ‎2023-01-20

Hello Martin, did you find the culprit?

Timothy_Hall · ‎2020-11-23

As Val said the CPU jump could be caused by excessive rulebase lookup overhead due to lack of SecureXL Accept Templates. Another situation that can manifest itself as perpetually high CPU is issues with the state sync network in HA, usually driven by a dramatic jump in the new connections/sec rate. Does cpview reveal a jump in this rate around the time this started?

What is the size of your Internet-routable addressing space on this firewall? /24? Larger? Drop Optimization/Templates is not the greatest dealing with spikes in "trash" hack/scanning traffic coming in from the Internet which can suddenly increase without warning, and is exacerbated by a large Internet-routable footprint size. A SecureXL penalty box setup does a much better job.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Are you a member of CheckMates?

High CPU VSX PSLXL R80.20 - how to debug fw_worker?