Solved: Re: High CPU after upgrade from 77.30 to 80.10

Kurt_Abela · ‎2017-11-23

Yesterday i upgraded from a distributed 77.30 cluster + mgmt to r80.10 on two 5600 appliances and smart 210 mgmt.

Today we are encountering 100% cpu usage on 3 core of the gateway while the other core (4 cores in total in 5600), which is used for dynamic dispatcher is idle. Setup was running fine on 77.30. Processes fw_worker_0 1 2 are the culprits.

I am also noticing the error below in var log messages

Nov 23 17:53:44 2017 GW1 kernel: [fw4_2]^[ERROR]: fw_up_limit_new_conn: fwpslglue_newconn() failed

Any ideas please?

Timothy_Hall · ‎2017-11-27

> The issue was "resolved" after disabling most of the rules in application control.

Right this is really common, accidentally using "Any" in an APCL/URLF policy, or using the dynamic object "Internet" when your firewall's topology is not completely and correctly defined causes large amounts of LAN-speed traffic to get sucked into the Medium Path.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

PhoneBoy · ‎2017-11-23

Are you just seeing issues with the CPU spiking or are there other traffic issues as well?

Timothy_Hall · ‎2017-11-23

Did you apply the latest GA HFA for R80.10?

After running top and hitting 1, what main type of CPU load are you observing on the 3 cores allocated as workers? (us/sy/wa/hi) Also please provide output of the following commands from the active cluster member:

enabled_blades

fwaccel stat

fwaccel stats -s

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kurt_Abela · ‎2017-11-23

CPU is constant 100% on all 3 cores. this morning it is at 60% as users are still not at the office.

yes fw_workers are consuming most CPU on all 3 cores.

[Expert@GW1:0]# enabled_blades
fw vpn urlf av aspm appi identityServer SSL_INSPECT anti_bot ThreatEmulation mon vpn

[Expert@GW1:0]# fwaccel stat
Accelerator Status : on
Accept Templates : disabled by Firewall
Layer ---Drop Templates : enabled
NAT Templates : disabled by user
NMR Templates : enabled
NMT Templates : enabled

Accelerator Features : Accounting, NAT, Cryptography, Routing,
HasClock, Templates, Synchronous, IdleDetection,
Sequencing, TcpStateDetect, AutoExpire,
DelayedNotif, TcpStateDetectV2, CPLS, McastRouting,
WireMode, DropTemplates, NatTemplates,
Streaming, MultiFW, AntiSpoofing, Nac,
ViolationStats, AsychronicNotif, ERDOS,
McastRoutingV2, NMR, NMT, NAT64, GTPAcceleration,
SCTPAcceleration
Cryptography Features : Tunnel, UDPEncapsulation, MD5, SHA1, NULL,
3DES, DES, CAST, CAST-40, AES-128, AES-256,
ESP, LinkSelection, DynamicVPN, NatTraversal,
EncRouting, AES-XCBC, SHA256

[Expert@GW1:0]# fwaccel stats -s
Accelerated conns/Total conns : 91/6590 (1%)
Accelerated pkts/Total pkts : 76830/359923 (21%)
F2Fed pkts/Total pkts : 106640/359923 (29%)
PXL pkts/Total pkts : 176453/359923 (49%)
QXL pkts/Total pkts : 0/359923 (0%)

Timothy_Hall · ‎2017-11-24

OK, a few things:

0) I'll ask again, have you applied the latest R80.10 GA jumbo HFA?

1) With that many blades enabled on a 5600 w/ 8 GB of RAM, wondering if you are running short of memory. Please provide output of commands free -m and uname -a

2) I'm trying to make sense of that error message you are seeing in the syslog, on the firewall cluster object do you have "Automatically" set under "Optimizations"?

3) Looks like you have Optimized Drops enabled which is not real common, try turning it off and see if the situation improves.

4) Suspecting a possibly unhealthy sync network as well, please provide output of fw ctl pstat

5) Final thing to try: power off the standby member and see what happens to CPU load on the remaining member. If it drops back to normal, that is highly indicative of some kind of ClusterXL issue (including possibly #4 above).

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kurt_Abela · ‎2017-11-26

Many thanks for your assistance.

The issue was "resolved" after disabling most of the rules in application control. We are now creating a new more optimized rulebase to mitigate the issue. Having said that, there were no particular issues or problems with the rulebase itself but we are not yet using in-line layers as we simply migrated form R77.30 to r80.10 at this stage.

0) yes take 42

1) RAM was fine at the time but i currently do not have any output of this during the time of the issue.

2) it was set to automatic. Peak connections was less than 20k and most of the time it was around 6-7K

3) This was turned on as per TAC support to optimise drops and maybe improve CPU usage. It did not make any difference.

4) do not have any output of this during the time of the issue.

5) we did this during the issue and did not make a difference.

Timothy_Hall · ‎2017-11-27

> The issue was "resolved" after disabling most of the rules in application control.

Right this is really common, accidentally using "Any" in an APCL/URLF policy, or using the dynamic object "Internet" when your firewall's topology is not completely and correctly defined causes large amounts of LAN-speed traffic to get sucked into the Medium Path.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Alastair_Haddix · ‎2018-04-11

Tim,

How would we define our topology and not use an any or internet for the app/url policy. We need this type of functionality to define which zones we want to apply the policy to and also use these for logging traffic. We are experiencing high cpu at the moment with all fw_worker being the top cpu.

Timothy_Hall · ‎2018-04-12

Assuming you are in an APCL/URLF policy layer, from an optimization perspective using object "Internet" is just fine (assuming that it is properly defined in your firewall's topology settings). "Any" is what you want to avoid to keep traffic from needlessly getting pulled into the Medium Path (PXL). Perhaps an example will help.

Let's assume that you are using an ordered APCL/URLF policy layer for just those features. I'll use an ordered layer here since most Check Point admins have a fairly easy time understanding how ordered layers work, because R77.30 and earlier gateways operated this way. In addition, right after a R77.30 to R80+ SMS upgrade ordered layers will be the default.

An optimized APCL/URLF policy to maximize high-speed LAN traffic that can be accelerated is generally constructed as follows, if you have an R80.10 gateway Security Zones will make this much easier. Let's assume a firewall with four interfaces and each has a single Security Zone associated with it: Inside1, Inside2, DMZ, Outside. Let's also assume we are doing a blacklist approach for applications, so the Implicit Cleanup Action for this layer is Accept:

Name: Access Exceptions for certain users/groups

Source: Access Role(s) specifying users/groups, set Networks on ALL access roles here to a list of all internal subnets. Do not include DMZs, unfortunately Security Zones can not be specified on the Network tab of an Access Role.

Destination: Outside zone

Applications: Facebook, etc

Action: Accept

Track: Detailed Log

Name: Block Bad Stuff for all users

Source: Inside1, Inside2 zones

Destination: Outside zone

Application: Group of prohibited applications

Action: Drop

Track: Log

Name: Separately log unknown applications (optional)

Source: Inside1, Inside2 zones

Destination: Outside zone

Application: Unknown Traffic

Action: Accept

Track: Detailed Log

Name: Log all else for reporting purposes (optional)

Source: Inside1, Inside2 zones

Destination: Outside zone

Application: Any ("Any Recognized" in R77.30)

Action: Accept

Track: Detailed Log

(Missing cleanup rule - Unmatched traffic will be accepted and not logged)

Notice that traffic flowing in the following directions through the firewall won't match any rule in this policy layer at all and will "fall off" the end of this policy layer and hit the Implicit Cleanup Action of Accept:

Inside1,Inside2 -> DMZ

DMZ -> Inside1,Inside2

Inside1 -> Inside2

Inside2 -> Inside1

This is the desired effect, the high-speed LAN traffic blazing between these zones will not be evaluated by APCL/URLF at all, and is eligible to be fully accelerated by SecureXL in the SXL path. This assumes of course that the policy associated with another blade such as IPS or Threat Prevention does not need to pull that same traffic up into PXL for inspection. Using the tricks shown in my CPX presentation here, IPS/TP can be switched off on the gateway "on the fly" to see if this is indeed the case.

In the TP policy using these same techniques (and so-called "null" TP profiles covered in my book, NOT a TP Exception) can ensure that high-speed LAN traffic does not get unnecessarily dragged into PXL, which is a classic cause of the high Firewall Worker CPU utilization you are seeing.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kurt_Abela · ‎2018-08-29

Hi,

So if instead you leave the implicit cleanup rule, you set an explicit rule Source internal Destination 'Any', this will cause LAN traffic to be inspected?

thanks

Timothy_Hall · ‎2018-08-30

Yes it will, and you don't typically want to do that. Only traffic that "falls off" the end of the APCL/URLF layer and hits the implicit cleanup rule will not be inspected by APCL/URLF in PXL. There is no way to define an explicit APCL/URLF rule that basically says "don't inspect this", it just has to fall off. For an APCL/URLF layer, there is no equivalent of the "null" Threat Prevention profile trick to avoid PXL inspection as detailed in my book.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kurt_Abela · ‎2018-09-05

Thanks for the explanation.

It is a bit counter-intuitive considering that you get the bellow "warning" on management when you remove the explicit clean up rule

SmartConsole R80.10 Help

"Important - Always add an explicit Cleanup Rule at the end of each Layer, and make sure that its Action is the same as the Action of the Implicit Cleanup Rule. If there is no explicit Cleanup Rule, one of these messages will show under the last rule of the Layer:

Missing cleanup rule – Unmatched traffic will be dropped and not logged.
Missing cleanup rule – Unmatched traffic will be accepted and not logged."

Timothy_Hall · ‎2018-09-07

I think that recommendation about always adding an explicit cleanup rule is based more on clarity of policy than performance considerations. That help verbiage may also be a holdover from R77.30 management, where there was no warning message stating what would happen with no explicit cleanup rule present; kudos to Check Point for including that warning in R80+ management to clarify exactly what will happen with no explicit cleanup rule in a policy layer.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

PhoneBoy · ‎2018-09-05

I'm pretty sure if you specify a track of Log (versus Detailed or Extended) and don't specify any applications in a rule, then you shouldn't take a Medium Path hit.

That assumes all rules that match are this way.

Timothy_Hall · ‎2018-09-05

I don't think that is the case in R80.10 and earlier unless some kind of subtle change has been slipped into the gateway code along the way; remember that in R77.30 an Application column set for Any would display "Any Recognized" which is a bit more accurate. "Any Recognized" oddly enough also includes any unknown applications, since "Unknown Traffic" is actually its own application that is "recognized". Even if the Application column is set to "Any" in R80.10, APCL still needs to identify the application and that can only happen in the Medium Path.

Here is what happens at policy install time on the gateway: once the atomic load completes into the INSPECT driver, SecureXL is automatically restarted (this restart will no longer happen in R80.20 gateway). At that time SecureXL must determine based on IP addresses and port numbers *only* what types of connections will require "deep inspection" by APCL/URLF (and Threat Prevention among other things) and must be sent PXL. In the case of APCL/URLF it scans through the source, destination, and service columns of all APCL/URLF rules. It essentially calculates ranges for these three columns that DO NOT match any APCL/URLF explicit rules whatsoever; connections falling into those non-matching ranges will attempt to be handled completely by SecureXL in the SXL path. Of course there could still be a future violation for those non-matching connections (violation counters are viewed with fwaccel stats -p) that forces inspection up into F2F anyway such as the packet being fragmented. The SecureXL calculated ranges can actually be viewed with the sim ranges command.

In the case of Threat Prevention, SecureXL also scans the TP policy and calculates the ranges of source, destination & service values that DO NOT match any Threat Prevention rules. However SecureXL will also look at the TP profile and which TP blades are actually being invoked in the TP profile specified in the Action column. If there is a "null" TP profile that has all five threat prevention blades unchecked, the source/destination/service of that rule is automatically added to the non-matching ranges that will attempt to be fully accelerated by SecureXL in the SXL path.

There is simply no way to do this "null profile" trick with an explicit APCL/URLF rule. Essentially unless the traffic "falls off" the end of an APCL/URLF ordered layer (or an Application/Category object is not specified in any matching inline layers that are invoking APCL/URLF) it will go PXL. It doesn't matter what the Action or Track setting is for a rule invoking APCL/URLF.

I figured all this out when researching my book but didn't include the above because I couldn't find a way to clearly explain it; hopefully the above didn't get too muddled. Everything in this post is my personal opinion based on my own experience and research, if anything is incorrect I'd love to hear from those inside Check Point with deep knowledge of the actual SecureXL code. 🙂

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Matt_Taber · ‎2018-09-06

Thanks for the thorough explanation, great stuff.

On a R77.30 cluster, w/ R80.10 management, an APP/URL rules like this:

Would be the likely culprit of this:

[Expert@fw1:0]# fwaccel stats -s
Accelerated conns/Total conns : 121/117288 (0%)
Accelerated pkts/Total pkts : 2305602/347562163 (0%)
F2Fed pkts/Total pkts : 21349491/347562163 (6%)
PXL pkts/Total pkts : 323907070/347562163 (93%)
QXL pkts/Total pkts : 0/347562163 (0%)

Timothy_Hall · ‎2018-09-07

Yep exactly. A primary tuning goal is to make as much traffic as possible eligible for the SXL path and your rule completely defeats that goal. If PXL cannot be avoided for most traffic due to the blades enabled on the firewall, a secondary goal is to save as much CPU overhead as possible in PXL by not having unnecessary blades inspecting the traffic via policy optimizations, using TP exceptions, or employing many other techniques described in my book.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

High CPU after upgrade from 77.30 to 80.10