Your firewall is on fire

ED · ‎2018-06-13

So there you sit in your comfy chair and drink your morning coffee, sun is shining and then suddenly boom

You put away your coffee and start investigating. What on earth is happening? Why is your CPU cores suddenly spiking so high? Are you under attack? One user or many users causing this? Where do you start investigating? What commands, tools or views do you use? Can we have a discussion where people share what they do in situations like this when it suddenly happens? Something like the top 3 CLI commands. Share your top 3 investigating steps.

You thought your firewall was tuned, didn't you

Petr_Hantak · ‎2018-06-13

Yeah this could happen easilly. I believe that you'll see many ways what you can check.

My personal checks are:

top - for check if just any other stucked process consuming CPUs. Time to time could even CLISH instance freeze and start killing CPU.
fwaccel stat - is acceleration fine? Have you got drop templates enabled? I have experience that the SecureXL could turn off itself because of error counter in it. We hit this already twice in production (always with big impact) and general fix not exist yet.
fw ctl pstat - see counters, watemarks and connection limits
fw tab -t connections -s - again checking connection number and see if it reached limit for example
fwaccel conns | awk '{printf "%-16s %-16s %-10s\n", $1,$3,$4}' | sort | uniq -c | sort -n -r | head -n 50 - to see top 50 connections when acceleration is running
fw tab -u -t connections |awk '{ print $2 }'|sort -n |uniq -c|sort -nr|head -50 - to see 50 top connections according connection table
check /var/log/messages and core dump folder - just for sure
check interfaces counters and related switch interfaces utilization
try cpview, Smart View Monitor or other monitoring tool - to see if it could be connected to interface utilization

Of course could be much more, but it depends on first finding results. I hope that other guys will share more interesting commands/hints here.

Timothy_Hall · ‎2018-06-13

Petr Hantak‌ had some excellent suggestions, to dig in a little deeper you need to determine which specific type of CPU execution is tying up the CPU; this will give you some important clues about where to focus your efforts The best tool for this is running top in real-time while the event is occurring, sar can also be used in historical mode but it rolls up the sy/si/hi/st values shown in top into a single figure (%system) which can obscure where the issue is occurring. top can be run in batch mode to catch intermittent spikes which is covered in my book.

So if you run top look at the us/sy/ni/id/wa/hi/si/st values which are listed below along with hints about how to proceed if that particular value is the high one:

us - Consumption by processes, should be fairly low on a gateway unless there are features enabled such as HTTPS Inspection which cause "process space trips" on the firewall; this effect and what you can do about it is extensively covered in the second edition of my book. fwd or its buddies can definitely be a culprit here if the gateway logging rate is extremely high as well. Note that fw_worker_X CPU execution is NOT counted here, even though they look like processes, see sy below.

sy - CPU consumption processing traffic in the Firewall (F2F) and Medium (PXL) paths, fw_worker_X CPU usage is usually counted here. The fw_worker_X "processes" shown in top are simply representations of the firewall workers down in the kernel and not really processes in the traditional sense, in some cases CPU usage by fw_worker_X "processes" will appear under si, see below.

ni - Execution by processes that have had their process CPU priority lowered (nice'd), irrelevant on a gateway but important on an SMS.

id - Idle time, hopefully self explanatory.

wa - Percentage of time a CPU was blocked (unable to do anything) waiting for an I/O event to occur (usually hard drive access). Anything higher than 5% here (unless policy is currently being installed) is probably a low free memory situation on a gateway, use free -m to investigate further. Any nonzero swap usage may indicate the need for more RAM or the presence of a runaway process consuming excessive amounts of memory.

hi - Percentage of CPU time processing hardware interrupts, on a gateway this is almost all the transfer of packets from the NIC hardware buffers into RAM memory (ring buffer). An excessive value here could indicate extremely high packet rates traversing the firewall or possibly a NIC hardware/driver issue.

si - Soft Interrupts, SoftIRQ processing (i.e. emptying the ring buffer and sending the packets up for inspection) AND the handling of fully-accelerated traffic in the Accelerated path (SXL). If this value is high and your cores allocated to SND/IRQ functions are getting slammed, you may need to reduce the number of Firewall Worker cores so that more SND/IRQ cores can be allocated.

st - Steal - Percentage of CPU cycles requested but denied by the Hypervisor. On a bare-metal firewall (i.e. non VSec/VE) this should always be zero.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

ED · ‎2018-06-13

Appreciate the thorough explanation of the top command result related to the gateway performance. While I didn't catch a screenshot of the top result while they were at peak, here is the rest of screenshot from the screenshot above:

It looks for me that it was a lot of Windows update causing it, probably at same time. Traffic to internal WSUS 234 GB and towards Internet 106 GB for today from Smartview high bandwidth application.

EDA_IT_Security · ‎2018-06-15

And depending on your investigations, following Petr Hantak and Tim Hall indications, and if your environment evolved. You might end up running a cpsizeme to check whether your firewalls are still suitable for that environment.

Gaurav_Pandya · ‎2018-06-15

Hi Tim,

Thanks for sharing detailed explanation of TOP command

Sven_Glock · ‎2018-06-15

If you don't want to dig too deep the following tools are also pretty helpful in giving a quick advice of possible root causes:

Healthcheck-Tool: How to perform an automated health check of a Gaia based system
CP-Monitor: Traffic analysis using the 'CPMonitor' tool

Vincent_Bacher · ‎2018-06-15

In addition to the top command, using pstree is very useful as well, to see which process is called by which parent.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

Huseyin_Rencber · ‎2018-06-16

I would start investigation with sxl , top connections, counters limitations, messages.

fwaccel stats >

Displays SecureXL acceleration statistics

cat /proc/ppk/stats >

Displays total number of packets that passed through interface

cat /proc/ppk/drop_statistics >

Displays SecureXL drop statistics

cpview >

Displays the CPU utilization (and many other counters)

cat /proc/interrupts >

Displays the number of interrupts on each CPU core from each IRQ

fw ctl pstat >

Displays FireWall internal statistics about memory and traffic

netstat -ni >

Displays a table of all network interfaces

sar [-u] [-P { <cpu> | ALL }] [interval_in_sec [number_of_samples]]

Displays information about CPU activity, network devices, memory, paging, block IO, etc.

you can also check sk109236.

PhoneBoy · ‎2018-06-20

I am going to put this in General Product Topics‌ where it belongs.

Love the thread, keep it going!

ED · ‎2018-06-20

Dameon, how would your approach to a situation like this be? I think it's interesting for us all to hear that.

PhoneBoy · ‎2018-06-20

Y'all have covered most of the things I'd try

Thomas_Allen · ‎2018-06-21

I'm curious what version you are running. We are running R77.30, and just recently turned on CoreXL Dynamic Dispatcher (sk105261). It is on by default in R80.10.

We went through some of these steps trying to figure out what was causing the spike. Turns out it was one of our partners uploading/downloading content, consuming 100% of a cpu core. The good thing that come from this incident was the discovery of CoreXL Dynamic Dispatcher, and Priority Queuing that comes with it (sk105762). Since enabling these two SK's, cpu utilization on an individual core does reach 100%, but it does not stay there. Traffic is sent to other cores that are not as busy, spreading the load out.

Martin_Raska · ‎2018-06-26

Hi,

I am adding what was not covered, look at history utilization by SAR or CPVIEW -t (history), try to find some spikes and look at traffic at each interface or use some other monitoring tool.

Check in cpview Top-Connections in Network tab and also CPU tab, to see how much CPU time consume each of one.

In Advanced, Network tab you can see how much traffic is processed by SLX, PXL and F2F, this should give you hint what blades are causing it.

If its IPS look at sk110737 to evaluate signatures impact. After that its all about properly tunning SecureXL and CoreXL.

Are you a member of CheckMates?

Your firewall is on fire