ips bypass

Luis_Miguel_Mig · ‎2021-02-18

I have started to get ips bypass alerts since I upgraded to r80.40 take 91. I didn't use to get IPS bypass events in take 87.
There is almost not traffic in my environment - 20 concurrent tcp sessions coming from one host I use for testing/browsing - and the cpu is idle most of the time.

I have 6 cores - 3 workers. The average cpu is 2%, occasionally goes to 20% but looking at cpview I have notices spikes that match the IPS bypass alerts - see below.

I am certain the issue has to something to do with take 91 but I was wondering if there is a way to get more verbose logging to see what is going on when the cpu usage goes over the threshold.

I am running URL filtering, Anti bot , Antivirus and IPS enabled. I have disabled HTTPS inspection recently.
I am getting about 90% of traffic through the slow path.

Spikes |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
| CPU Spikes |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Overview (last minute): |
| |
| Total Spikes: 3 |
| Average Spike Duration (Sec): 11 |
| Average Spike Usage: 95% |
| ------------------------------------------------------------------------------------------------------------------------------------------------ |
| Top Spikes (last minute): |
| |
| Start Time CPU Spike Duration (Sec) Average Usage |
| 18Feb2021 9:07:36 5 25 100% |
| 18Feb2021 9:08:41 5 5 93% |
| 18Feb2021 9:08:51 2 5 92% |
|

shais · ‎2021-02-18

Hi,
We have a number of ways to understand this change, will appreciate doing a remote session with you to understand the issue - I will contact you directly to arrange a remote session.

Please open an SR as well

Luis_Miguel_Mig · ‎2021-02-18

Thanks.
I would say the securexl is okay. The traffic that goes through the slow path is expected, right?

[Expert@fw1:0]# fwaccel stats -p
F2F packets:
--------------
Violation Packets Violation Packets
-------------------- --------------- -------------------- ---------------
pkt has IP options 0 ICMP miss conn 0
TCP-SYN miss conn 0 TCP-other miss conn 50
UDP miss conn 197 other miss conn 0
VPN returned F2F 0 uni-directional viol 0
possible spoof viol 0 TCP state viol 0
SCTP state affecting 0 out if not def/accl 0
bridge, src=dst 0 routing decision err 0
sanity checks failed 0 fwd to non-pivot 0
broadcast/multicast 0 cluster message 10289
cluster forward 0 chain forwarding 0
F2V conn match pkts 0 general reason 0

shais · ‎2021-02-18

You mentioned that most of your traffic is the slow path - this can trigger the IPS bypass as it will cause a high load on the CPU

The statistics you showed above mean you don't have any violations in SecureXL which is good but it's unrelated to the slow path.
You can see the slow path rate at "fwaccel stats -s"

Luis_Miguel_Mig · ‎2021-02-18

When my testing vm is down and therefore there is only mgmt traffic meaning (gateways cluster messages, ntp, dns, snmp, syslog, http request to the checkpoint cloud through the proxy, etc) almost 100% of the traffic is not accelerated. Is this behavior expected? Should any of this traffic be accelerated?

When I browse a bit with my test vm I see the accelerated packets increase. See below

testing vm down

[Expert@fw1:0]# fwaccel stats -s
Accelerated conns/Total conns : 0/0 (0%)
Accelerated pkts/Total pkts : 0/3302 (0%)
F2Fed pkts/Total pkts : 3302/3302 (100%)
F2V pkts/Total pkts : 0/3302 (0%)
CPASXL pkts/Total pkts : 0/3302 (0%)
PSLXL pkts/Total pkts : 0/3302 (0%)
CPAS pipeline pkts/Total pkts : 0/3302 (0%)
PSL pipeline pkts/Total pkts : 0/3302 (0%)
CPAS inline pkts/Total pkts : 0/3302 (0%)
PSL inline pkts/Total pkts : 0/3302 (0%)
QOS inbound pkts/Total pkts : 0/3302 (0%)
QOS outbound pkts/Total pkts : 0/3302 (0%)
Corrected pkts/Total pkts : 0/3302 (0%)
[Expert@hqfw2b:0]# fwaccel stat
+---------------------------------------------------------------------------------+
|Id|Name |Status |Interfaces |Features |
+---------------------------------------------------------------------------------+
|0 |SND |enabled |eth0,eth2,eth3,eth5,eth6 |Acceleration,Cryptography |
| | | | |Crypto: Tunnel,UDPEncap,MD5, |
| | | | |SHA1,NULL,3DES,DES,AES-128, |
| | | | |AES-256,ESP,LinkSelection, |
| | | | |DynamicVPN,NatTraversal, |
| | | | |AES-XCBC,SHA256,SHA384 |
+---------------------------------------------------------------------------------+

Accept Templates : enabled
Drop Templates : disabled
NAT Templates : enabled

testing vm is up

[Expert@hqfw2b:0]# fwaccel stats -s
Accelerated conns/Total conns : 0/97 (0%)
Accelerated pkts/Total pkts : 4215/9061 (46%)
F2Fed pkts/Total pkts : 4846/9061 (53%)
F2V pkts/Total pkts : 110/9061 (1%)
CPASXL pkts/Total pkts : 0/9061 (0%)
PSLXL pkts/Total pkts : 4215/9061 (46%)
CPAS pipeline pkts/Total pkts : 0/9061 (0%)
PSL pipeline pkts/Total pkts : 0/9061 (0%)
CPAS inline pkts/Total pkts : 0/9061 (0%)
PSL inline pkts/Total pkts : 0/9061 (0%)
QOS inbound pkts/Total pkts : 0/9061 (0%)
QOS outbound pkts/Total pkts : 0/9061 (0%)
Corrected pkts/Total pkts : 0/9061 (0%)

Luis_Miguel_Mig · ‎2021-02-18

Just reading at sk32578
When SecureXL is enabled, all packets should be accelerated, except packets that match the following conditions:
All packets that match a rule, whose source or destination is the Security Gateway itself.

So I guess in my environment with only one user establishing connections, the percentage of accelerated traffic is expected to be low.
And if this user is down, then pretty much 100% of the packets should be non accelerated.
I guess it would still be interesting to double check it if it is possible.

shais · ‎2021-02-18

When your testing VM is down, the traffic you have is only local connections - this is a slow path (by design)

So it looks like you don't have any issue here related to SecureXL but indeed something triggers a high load which cause IPS to enter bypass - we will continue offline to analyze it

Timothy_Hall · ‎2021-02-18

Generally enabling the IPS bypass feature is not a good idea. When monitoring the CPUs if even one of them hits the CPU % threshold, IPS functions on ALL CPUs are bypassed. This was fine when firewalls only had a few cores, but not really appropriate with many cores. Really the IPS Bypass feature should average the CPU utilization of all the workers when making the decision of whether to bypass. See here:

sk107334: IPS Bypass is triggered even when CPU utilization is not over the defined threshold

As @shais said it looks like something in T91 is causing occasional high CPU and triggering the IPS bypass; so the IPS bypass activating is just a symptom of your problem but not the cause. Normally the next step is to figure out in what mode the CPU spikes are (kernel vs. process space - us/sy/si/hi in top); you can use sar for that but it looks like the spikes are too short for sar to reliably pick up. You'll have to catch whatever it is "in the act" with top, or look in the spike detective logs here: /var/log/spike_detective/spike_detective.log

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Luis_Miguel_Mig · ‎2021-02-18

Yeah absolutely.
/var/log/spike_detective/spike_detective.log doesn't say too much though. Just the duration of the spike and the core id.
Sar seems to capture stats only every 10 mins and the spikes last between 10 and 20 secs.

Zolo · ‎2021-11-13

@Luis_Miguel_MigI have the same problem, did you find the root cause?

Luis_Miguel_Mig · ‎2021-11-15

It happens when the gateway loads the antibot/antivirus signatures at the times where it is scheduled in the smartconsole configuration. You can reproduce it with fw load_sigs.

Timothy_Hall · ‎2021-11-15

This is expected behavior but it only spikes a single core, so the chances of affecting traffic handling are pretty low: sk174347: Software blade updates may cause single CPU spikes

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Luis_Miguel_Mig · ‎2021-11-16

Yeah, it is a known issue now and it can affect only the traffic going through the firewall instance/cpu core were the signature loading process is running.
So if you have 4 example 4 cores/fw instances, 25% of the traffic can be affected.

Zolo · ‎2021-11-16

@Timothy_Hall: Thank you for the SK. But the customer's problem is that the IPS is going to Bypass and the traffic is not inspected by IPS for 1 minute because of a litle signature update on 72 core appliance with low network traffic. BTW the Anti-Bot/Anti-Virus Blades are off, only the IPS blade is on (and FW of course).

Is this an expected behavior and we can't change it, or maybe I can turn off signature updates somehow?

Zolo · ‎2021-11-16

@Luis_Miguel_Mig: Thank you !!! That was my guess, but I did not find how to reproduce by hand. Thanks again 😊

Luis_Miguel_Mig · ‎2021-11-22

I was wondering if we could use affinity settings to make this process run in a specific cpu core. I have more cpu cores than fw workers

Timothy_Hall · ‎2021-11-22

You could cause affinity to do that, but it won't matter to the IPS Bypass feature as all it takes is one saturated core (regardless of type) for IPS to get disabled. The IPS Bypass feature was a good idea in the days when firewalls only had 1-2 cores, not so much in today's world.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Luis_Miguel_Mig · ‎2021-11-22

That would be okay with me. I don't mind to get a IPS bypass. I may disable the IPS bypass feature altogether.
But how could set the affinity of the fw process to a specific core so fw load_sigs run on a core free of fw_workers?

Are you a member of CheckMates?

ips bypass