Unusual high CPU after migration VSX R77.30 to R80...

Martin_Oles · ‎2020-09-15

Hi,

I am just wonder, if you observe similar behavior. We are running VSX cluster Check point 12200, 4CPUs, 8G of memory, CPAC-4-10 line card.

Prior migration from R77.30 to R80.30 hfa 215 we had CPU utilization:

[Expert@FW01:2]#top
top - 15:08:42 up 541 days, 13:07, 2 users, load average: 0.96, 1.04, 0.95
Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
Cpu(s): 4.6%us, 2.3%sy, 0.0%ni, 87.2%id, 0.1%wa, 0.2%hi, 5.5%si, 0.0%st
Mem:   8029532k total, 7977484k used,    52048k free,   419772k buffers
Swap: 18908408k total,      544k used, 18907864k free, 2963120k cached

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
16043 admin      0 -20 818m 278m 21m S   52 3.6 120937:59 fwk2_dev
9718 admin     15   0 448m 73m 25m S   16 0.9   9965:41 fw_full
16026 admin      0 -20 649m 109m 21m S    6 1.4 31184:29 fwk1_dev

Virtual system had only 1 virtual instance (CPU) and was working just fine.

But with the very same rulebase and configuration we are having CPUs through the roof.

[Expert@FW01:2]# top
top - 09:45:29 up 2 days, 6:48, 5 users, load average: 4.04, 4.22, 4.39
Tasks: 163 total, 1 running, 162 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 21.3%id, 0.0%wa, 0.3%hi, 78.3%si, 0.0%st
Cpu1 : 56.1%us, 10.0%sy, 0.0%ni, 27.6%id, 0.0%wa, 0.0%hi, 6.3%si, 0.0%st
Cpu2 : 64.3%us, 10.3%sy, 0.0%ni, 18.3%id, 0.0%wa, 0.0%hi, 7.0%si, 0.0%st
Cpu3 : 62.1%us, 12.0%sy, 0.0%ni, 19.9%id, 0.0%wa, 0.0%hi, 6.0%si, 0.0%st
Mem: 8029492k total, 4010952k used, 4018540k free, 284872k buffers
Swap: 18908408k total, 0k used, 18908408k free, 1106088k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26699 admin 0 -20 1741m 1.0g 110m S 203 13.6 653:26.79 fwk2_dev_0
27524 admin 15 0 595m 95m 39m S 12 1.2 32:44.97 fw_full
10245 admin 15 0 0 0 0 S 4 0.0 11:56.26 cphwd_q_init_ke
18641 admin 0 -20 809m 194m 46m S 3 2.5 32:04.54 fwk1_dev_0

I have added to virtual system three virtual instances (CPUs), But still not enough. Also I am observing quite high CPU0 where is dispatcher.

I have checked SIM affinity, looks fine for me:

[Expert@FW01:0]# sim affinity -l -r -v
eth1-02 : 0
eth1-03 : 0
eth1-01 : 0
Mgmt : 0

[Expert@FW01:0]# fw ctl affinity -l -a -v
Interface eth1 (irq 234): CPU 0
Interface eth7 (irq 115): CPU 0
Interface Mgmt (irq 99): CPU 0
Interface eth1-01 (irq 226): CPU 0
Interface eth1-02 (irq 234): CPU 0
Interface eth1-03 (irq 67): CPU 0
VS_0 fwk: CPU 1 2 3
VS_1 fwk: CPU 1 2 3
VS_2 fwk: CPU 1 2 3

In affected virtual system I can observe surprisingly high amount of PSLXL traffic, but I could not compare it prior upgrade.

[Expert@FW01:2]# fwaccel stats -s
Accelerated conns/Total conns : 26754/115454 (23%)
Accelerated pkts/Total pkts : 5344525154/10554405196 (50%)
F2Fed pkts/Total pkts : 5629452/10554405196 (0%)
F2V pkts/Total pkts : 21331262/10554405196 (0%)
CPASXL pkts/Total pkts : 0/10554405196 (0%)
PSLXL pkts/Total pkts : 5204250590/10554405196 (49%)
QOS inbound pkts/Total pkts : 0/10554405196 (0%)
QOS outbound pkts/Total pkts : 0/10554405196 (0%)
Corrected pkts/Total pkts : 0/10554405196 (0%)

Tried also turn off IPS, also no help:

[Expert@FW01:2]# ips stat
IPS Status: Manually disabled
IPS Update Version: 635158746
Global Detect: Off
Bypass Under Load: Off

[Expert@FW01:2]# enabled_blades
fw ips

[Expert@FW01:0]# vsx stat -l

VSID: 2
VRID: 2
Type: Virtual System
Name: ntra
Security Policy: Standard
Installed at: 14Sep2020 20:03:55
SIC Status: Trust
Connections number: 118972
Connections peak: 119651
Connections limit: 549900

I have also observed in "fw ctl zdebug + drop" logs, it disappeared after adding virtual instance.

@;166488350;[kern];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=6 10.20.30.40:50057 -> 15.114.24.198:443 dropped by cphwd_pslglue_handle_packet_cb_do Reason: F2P: Instance 0 is currently fully utilized;

I do have opened support case for it, but so far nothing really helpful. Am I missing something?

Thank your for opinion and advice.

Martin_Oles · ‎2020-09-15

Output from fwaccel stats -s prior upgrade:

[Expert@FW01:2]# fwaccel stats -s
Accelerated conns/Total conns : 146685/233205 (62%)
Delayed conns/(Accelerated conns + PXL conns) : 3527/151585 (2%)
Accelerated pkts/Total pkts : 15054685273/15274541182 (98%)
F2Fed pkts/Total pkts : 81148412/15274541182 (0%)
PXL pkts/Total pkts : 138707497/15274541182 (0%)
QXL pkts/Total pkts : 0/15274541182 (0%)

and after upgrade:

[Expert@FW01:2]# fwaccel stats -s
Accelerated conns/Total conns : 26476/118159 (22%)
Accelerated pkts/Total pkts : 6035857883/11882865662 (50%)
F2Fed pkts/Total pkts : 6149406/11882865662 (0%)
F2V pkts/Total pkts : 23685230/11882865662 (0%)
CPASXL pkts/Total pkts : 0/11882865662 (0%)
PSLXL pkts/Total pkts : 5840858373/11882865662 (49%)
QOS inbound pkts/Total pkts : 0/11882865662 (0%)
QOS outbound pkts/Total pkts : 0/11882865662 (0%)
Corrected pkts/Total pkts : 0/11882865662 (0%)

Timothy_Hall · ‎2020-09-15

Almost certainly this issue, which is the TLS parser being invoked inappropriately causing high PSLXL: sk166700: High CPU after upgrade from R77.x to R80.x when running only Firewall and Monitoring blade.... This was also mentioned in the R80.40 addendum for my book.

You can try manually disabling the TLS parser as mentioned in the SK just to verify this is indeed your issue, but the best way to deal with this is load a Jumbo HFA that has the fix. For your release R80.30 the fix for this was added just two days ago in R80.30 Jumbo HotFix - Ongoing Take 219.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Martin_Oles · ‎2020-09-15

I have patched that cluster on Saturday Sept 12, and new Jumbo HotFix has been added on Sunday 😒

[Expert@FW01:2]# enabled_blades
fw

On firewall (or MultiDomain management) monitoring blade is not enabled.

When checking traffic which is send to PSLXL (fwaccel conns -f S) I have found around 10000 connections on HTTPS going to PSLXL. So, very likely issue with TLS parser.

Thank you!

Martin_Oles · ‎2020-11-26

So, update about this issue.

You were right, TLS parser has caused traffic to go via PSLXL, issue has disappeared after installation of R80.30 Jumbo HotFix - Ongoing Take 219.

[Expert@FW01:2]# fwaccel stats -s
Accelerated conns/Total conns : 116807/120869 (96%)
Accelerated pkts/Total pkts : 4853922830/5213520449 (93%)
F2Fed pkts/Total pkts : 6874512/5213520449 (0%)
F2V pkts/Total pkts : 18827387/5213520449 (0%)
CPASXL pkts/Total pkts : 0/5213520449 (0%)
PSLXL pkts/Total pkts : 352723107/5213520449 (6%)
QOS inbound pkts/Total pkts : 0/5213520449 (0%)
QOS outbound pkts/Total pkts : 0/5213520449 (0%)
Corrected pkts/Total pkts : 0/5213520449 (0%)

Via PSLSL goes only a bit of NetBIOS traffic on port 445 .

I have adjusted also affinity, majority of traffic goes via eth1-01 and eth1-02:

[Expert@FW01:0]# fw ctl affinity -l -a -v
Interface eth7 (irq 115): CPU 0
Interface Mgmt (irq 99): CPU 0
Interface eth1-01 (irq 226): CPU 0
Interface eth1-02 (irq 234): CPU 1
Interface eth1-03 (irq 67): CPU 0
VS_0 fwk: CPU 1 2 3
VS_1 fwk: CPU 1 2 3
VS_2 fwk: CPU 1 2 3

[Expert@FW01:0]# cat $FWDIR/conf/fwaffinity.conf
#
i eth1-02 1
i default auto

[Expert@FW01:0]# top
top - 10:06:44 up 13:02, 4 users, load average: 2.34, 2.56, 2.44
Tasks: 156 total, 2 running, 154 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 57.2%id, 0.0%wa, 0.7%hi, 41.8%si, 0.0%st
Cpu1 : 28.9%us, 4.4%sy, 0.0%ni, 14.1%id, 0.0%wa, 0.0%hi, 52.7%si, 0.0%st
Cpu2 : 18.8%us, 9.4%sy, 0.0%ni, 71.1%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu3 : 16.4%us, 5.7%sy, 0.0%ni, 76.6%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
Mem: 8029492k total, 4256544k used, 3772948k free, 160128k buffers
Swap: 18908408k total, 0k used, 18908408k free, 1440248k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19167 admin 0 -20 1465m 787m 142m S 97 10.0 233:15.40 fwk2_dev_0
3422 admin 15 0 604m 92m 39m R 11 1.2 31:37.59 fw_full

No it is much better, but still having CPU 1 above 80% of usage (50% SND + 30% FW worker virtual system 2). So I am wonder, how exactly to set affinity in safe way and preferably on the fly, to use for virtual system 2 only CPU 2 and CPU 3?

I might be wrong, but turning on Multi-Queue should not have so much effect in this case.

Are you a member of CheckMates?

Unusual high CPU after migration VSX R77.30 to R80.30 jumbo take 215