Re: Upgrading to R80.30 has caused one fw_worker t...

Tom_Cripps · ‎2020-01-16

Hi,

Since our upgrade to 80.30, our standby member in our cluster has had a fw_worker stuck at 100% cpu, it isn't a particular fw_worker it can change, when one drops another one takes it place essentially.

We're also now seeing that when we attempt policy installations we lose "GAiA" in essence as is presented with the raw Bash shell as you would see if booted in maintenance mode.

Anything obvious stick out to anyone?

Tom

HeikoAnkenbrand · ‎2020-01-16

With this onliner you can view the process load of each core. This can help you locate the process.

CORE=3; ps -e -o pid,psr,%cpu,%mem,cmd | grep -E  "^[[:space:]][[:digit:]]+[[:space:]]+${CORE}"

More read here: ONELINER - process utilization per core

➜ CCSM Elite, CCME, CCTE

Tom_Cripps · ‎2020-01-16

Hi Heiko,

Thank you for this. It turned out to be CCP wasn't not being allowed. Temporarily we have added a rule to allow CCP, QA feel this should be implied though.

Tom

KernelGordon · ‎2020-01-16

In R80.30 CCP Encryption was introduced. I would recommend checking the R80.30 ClusterXL Admin Guide and reading up on the new feature to make sure that it is not causing you any issues.

To test you can run `cphaconf ccp_encrypt off` from expert mode on both cluster members, if you stop seeing issues then this was the problem

Timothy_Hall · ‎2020-01-16

This is covered in the third edition of my book, it is probably one of these two things:

1) Cluster Members inappropriately attempting to inspect CCP traffic from other clusters: sk132672: High CPU on ClusterXL due to inspection of CCP packets

2) As Heiko said it could be the overhead of CCP encryption, quoted from my book:

Click to Expand

CCP Encryption

Starting in version R80.30 with Gaia kernel version 3.10, all CCP traffic is automatically
encrypted to protect it against tampering. You can confirm the configuration state of
CCP encryption with the expert mode command cphaprob ccp_encrypt. As long
as your firewall has AES New Instructions (AES-NI – covered in Chapter 9) as part of its
processor architecture, the additional load incurred by this CCP encryption is expected to
be negligible, and my lab testing has seemed to confirm this. I’d recommend leaving
CCP encryption enabled due to the security benefits it provides.

However let’s suppose you just upgraded your cluster to R80.30 or later with Gaia
kernel 3.10, and you are noticing increased CPU utilization that you can’t seem to pin
down. If you suspect it is the new CCP encryption feature causing the unexplained CPU
load (keep in mind this is more likely on firewall hardware that does not support AES-NI
– see Chapter 9), try these steps to confirm:

1. Baseline the firewall’s CPU usage
2. From expert mode, on all cluster members execute command cphaconf
ccp_encrypt off
3. Examine the firewall’s CPU usage, if it drops substantially consider leaving CCP
encryption disabled, but be mindful of the security ramifications
4. From expert mode, on all cluster members execute command cphaconf
ccp_encrypt on

CCP Encryption Starting in version R80.30 with Gaia kernel version 3.10, all CCP traffic is automaticallyencrypted to protect it against tampering. You can confirm the configuration state ofCCP encryption with the expert mode command cphaprob ccp_encrypt. As longas your firewall has AES New Instructions (AES-NI – covered in Chapter 9) as part of itsprocessor architecture, the additional load incurred by this CCP encryption is expected tobe negligible, and my lab testing has seemed to confirm this. I’d recommend leavingCCP encryption enabled due to the security benefits it provides. However let’s suppose you just upgraded your cluster to R80.30 or later with Gaiakernel 3.10, and you are noticing increased CPU utilization that you can’t seem to pindown. If you suspect it is the new CCP encryption feature causing the unexplained CPUload (keep in mind this is more likely on firewall hardware that does not support AES-NI– see Chapter 9), try these steps to confirm: 1. Baseline the firewall’s CPU usage2. From expert mode, on all cluster members execute command cphaconfccp_encrypt off3. Examine the firewall’s CPU usage, if it drops substantially consider leaving CCPencryption disabled, but be mindful of the security ramifications4. From expert mode, on all cluster members execute command cphaconfccp_encrypt on

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Tom_Cripps · ‎2020-01-17

Hi Tim,

Just purchased the book so will take a look over the new material. I would suggest though that it may be due to the SK linked. We're not inspecting CCP as we must still be running the old kernel.

The output of cphaprob ccp_encrypt is off due to us still utilising the 2.6 Kernel.

We're looking in the possibility of the issue being relating to that SK. Unless you have anything else to check?

Timothy_Hall · ‎2020-01-17

I guess the question is what the heck is that fw_worker doing that is making it so busy, is it processing traffic or performing some internal function like state sync, CCP processing etc.

I suppose it could be related to an elephant flow (which I'll be speaking about at CPX) that got assigned to that worker, try identifying the Firewall Worker instance number (this is usually different than the core number, use fw ctl affinity -l -r) then in cpview visit these two screens for that particular instance:

Advanced...CoreXL...Instances...FW-Instance#...Top FW-Lock consumers
CPU...Top-Connections...Instance#...Top Connections

If these screens don't show anything unusual for the busy instance, that tells me it is hung up on some kind of internal function and not something directly related to processing traffic.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Tom_Cripps · ‎2020-01-17

We're seeing around ~500,000 handled inbound packets by that worker in 20 seconds, bare in mind this is standby. The top connections is from 0.0.0.0:8116 to an interface we use for DMZ management and not Sync.

Tom

Tom_Cripps · ‎2020-01-20

Hi Tim,

Just an update for you. We was told that in R80.20 a feature was introduced which allows CCP to automatically set it's method. If i remember right we was using Multicast before, and it had changed to Unicast. Setting this now to Broadcast, has fixed this problem.

Thanks for the pointers.

Tom

Timothy_Hall · ‎2020-01-20

Got it, thanks!

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

abihsot__ · ‎2020-02-24

Upgrade R80.20 to R80.30 led to all fw_workers to be at 100% CPU. Is there anything obvious I should tune-up from the output below? Meanwhile I fixed the rule #270 as fwaccel complained.

Regular gateway cpu load is 20-30%.

Accept Templates : disabled by Firewall
Layer fw1c Security disables template offloads from rule #270
Throughput acceleration still enabled.
Template offloads disabled by IPS protections: network quota
Drop Templates : disabled
NAT Templates : disabled by Firewall
Layer fw1c Security disables template offloads from rule #270
Throughput acceleration still enabled.
Template offloads disabled by IPS protections: network quota

fwaccel stats -s
Accelerated conns/Total conns : 11791/20967 (56%)
Accelerated pkts/Total pkts : 7876604/14176070 (55%)
F2Fed pkts/Total pkts : 754520/14176070 (5%)
F2V pkts/Total pkts : 139728/14176070 (0%)
CPASXL pkts/Total pkts : 0/14176070 (0%)
PSLXL pkts/Total pkts : 5544946/14176070 (39%)
QOS inbound pkts/Total pkts : 0/14176070 (0%)
QOS outbound pkts/Total pkts : 0/14176070 (0%)
Corrected pkts/Total pkts : 0/14176070 (0%)

grep -c ^processor /proc/cpuinfo
4

/sbin/cpuinfo
HyperThreading=disabled

fw ctl affinity -l -r
CPU 0: eth1 eth5 eth2 eth6 eth3 eth4 Sync Mgmt eth1-01 eth1-02
CPU 1: fw_2
in.asessiond fwd dtpsd rtmd vpnd dtlsd mpdaemon lpd wsdnsd cpd cprid
CPU 2: fw_1
in.asessiond fwd dtpsd rtmd vpnd dtlsd mpdaemon lpd wsdnsd cpd cprid
CPU 3: fw_0
in.asessiond fwd dtpsd rtmd vpnd dtlsd mpdaemon lpd wsdnsd cpd cprid
All:

sim affinity -l
eth1-01 : 0
eth1-02 : 0
eth2 : 0
eth6 : 0
eth3 : 0
eth4 : 0
Sync : 0
Mgmt : 0
eth1 : 0
eth5 : 0

w ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 3 | 21044 | 21092
1 | Yes | 2 | 20262 | 20756
2 | Yes | 1 | 21122 | 21425

free -m
total used free shared buffers cached
Mem: 15850 3068 12781 0 77 711
-/+ buffers/cache: 2279 13570
Swap: 18394 0 18394

enabled_blades
fw vpn ips mon vpn

cpstat os -f multi_cpu -o 1

Processors load
---------------------------------------------------------------------------------
|CPU#|User Time(%)|System Time(%)|Idle Time(%)|Usage(%)|Run queue|Interrupts/sec|
---------------------------------------------------------------------------------
| 1| 0| 99| 1| 99| ?| 6365|
| 2| 8| 76| 16| 84| ?| 6365|
| 3| 1| 99| 0| 100| ?| 6365|
| 4| 23| 34| 43| 57| ?| 6365|
---------------------------------------------------------------------------------

Timothy_Hall · ‎2020-02-24

Yes, disable the IPS signature Network Quota immediately. If you still need quotas use the fw samp functionality (fwaccel dos rate in R80.40+), see here" sk112454: How to configure Rate Limiting rules for DoS Mitigation (R80.20 and newer)

Beyond that you probably need to do some IPS tuning, try this quick test to confirm:

1) Baseline current CPU usage

2) Execute command ips off

3) Wait 120 seconds

4) Baseline new CPU usage

5) Execute command ips on

If CPU usage dropped significantly when IPS was disabled you have received a preview of the gains you might be able to achieve with some IPS tuning. Once tuned up you might try changing CoreXL split to 2/2 since your single SND/IRQ core is so busy, but at the moment all CPUs look pretty utilized so I'd advise against that until after IPS is tuned.

Edit: In future I'd advise starting a new discussion thread for questions like this.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

abihsot__ · ‎2020-02-24

Sorry for not creating a separate thread for this. I was under impression I am replying to correct thread, however I cannot find it anymore. Subject was very similar.

Baseline cpu usage is when running on R80.20, which is around 20-30% always. Switching over to upgraded node with R80.30 I literally have 1 minute before gateway starts slowing down to a point when business start feeling the consequences.

I am happy to work on tuning IPS policy, it is just I wasn't expected to have such a big performance degradation when moving to R80.30 with the same IPS settings and other gateway settings. By the way, during the issue I was able to see that due to high system load IPS protections became bypassed. Probably not the same result as doing "ips off" anyway...

ips bypass stat
IPS Bypass Under Load: Enabled
Currently under load: No
Currently in bypass: No
CPU Usage thresholds: Low: 70, High: 90
Memory Usage thresholds: Low: 70, High: 90

Timothy_Hall · ‎2020-02-24

Upgrading to R80.30 shouldn't cause significantly higher CPU overhead, but based on the blades you have enabled IPS is the logical place to begin. 90% of effective troubleshooting is knowing where to look.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Khalid_Aftas · ‎2020-07-02

We got the samebehavior when upgrading mutiple vsx gateway to r80.30, were 40% of the traffic goes to PSLXL, and that is what causing the cpu/fwk issue.

Disabling IPS did not help at all.

It has to do with how traffic for tcp 443 445 .. is handeld by 80.30 code, the only way to workaround this, is to use fw ctl fast-accel feature to specificly make this traffic go trough SXL and processed by the SND.

Atm we have a tac case to find out the root cause.

abihsot__ · ‎2020-07-02

Indeed, that is good indication. In fact I was suspecting the same, as the only gateway which is stuck at R80.20 and was not migrated is external one, which handles lots of http/https traffic.

I did all the suggestions like optimizing IPS, secureXL. None of them helped. TAC - got some useless engineer...

There was an idea in my mind to go directly for R80.40 and hope something was re-engineered there, but haven't tried yet.

Please do let us know if you managed to solve the issue.

Khalid_Aftas · ‎2020-07-02

Same experience with TAC ... a lot of irrelevant suggestions.

Now it is escalated and hopefully on the right hands, with a focus on this specific lead.

I heard some really bad experience with 80.40 and latest JHF, you might need to hold your horses 😛

Your best course of action now, is to find connection with fwaccel conns and find the ones using medium path "s/S" and try to use fast accel to forcefly accelerate them, you loose security posture doing that, but if traffic is trusted, that better that impacting opertions.

Are you a member of CheckMates?

Upgrading to R80.30 has caused one fw_worker to be stuck at 100%