Re: High latency after Check Point firewall from R... - Page 2

Ben_Fung · ‎2018-03-21

As before we are running on CP R77.30 hardware model is 13500 with cluster appliance with smooth and normal performance. But after upgrade to R80.10 all network performance to slow down, for example, we have PRTG monitor (network via checkpoint) have monitor our website performance, on R77.30 the loading time around 5x-1xx ms, after R80.10 drop to 5xx - 4xxxx ms. The second case is we have ping test under our core switch, the interface plug into Checkpoint firewall directly. The R77.30 ping time is below 5ms, after R80.10 over 1x - 5x ms.

I know that's very simple evidence and test result. But how to be next step? Hope I can get more advice, Tks guys

Ben_Fung · ‎2018-03-22

The case now handle by the local distributor, I think they haven't created SR at this moment. the local distributor has arrange the local / remote session but no finding. So I want to find another channel to help this case.

Anyway, on remote session yesterday, I very strange that why the local distributor TAC said ping high latency is normal at CheckPoint firewall.

Ben_Fung · ‎2018-03-26

TAC Case number is 3-0061236651, can you help to follow and give more idea for this case? TKs again.

RickLin · ‎2018-03-22

Hi Ben

Do you ever try to check ring buffer size issue?

sk42181

G_W_Albrecht · ‎2018-03-23

I would suggest to consult sk98348: Best Practices - Security Gateway Performance - Including some useful tips for SecureXL, CoreXL, HyperThreading, Multi-Queue, and some references to other very good SKs about ClusterXL or VPN.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Christian_Froem · ‎2018-03-23

We have/had some serious issues with our gateways: the latency through the FW-modules grew and grew over time. We saw growing latencies *within* our network from (usually 1-2ms) up to 130ms when traffic passed the firewall. When some point of the latency was reached, it all collapsed ( ! ) when the modules spontaneously didn't forward any traffic all. Unfortunately without any HA-activity. The cluster just ran without new traffic, which is a rather bad condition for a central firewall.

So after weeks of debugging and testing we found a hint to a kernel-bug in the underlying linux: when the firewall had to issue icmp-unreachable-packets (which seemed in our case a different traffic-pattern) the kernel leaked memory. And this led to a growing memory-consumption which 1) led to the growing latency, and 2) eventual led to the collapse, when the memory was full all the CPUs went into 100%. We also found out, that ongoing-connections were not affected when it all collapsed, and this was most probably the cause that the HA-mechanisms weren't affected: the cluster just exchanged traffic as before.

We then had to convince the CP-Support/RnD that this is our error. So we now have a bugfix for this, running on one module and the mem-leak seems to be gone.

How we discovered the latency-issue: we use a graphing-tool called smokeping, where the latency is seen as "smoke" in the graphs. So we saw the growing latency once we knew the place to look at. Our quick-fix for the issue was simply to reboot the cluster-nodes every 5 days and to switch over to that freshly rebooted node.

We have an R80.10-FW-Cluster, both on modules and management, we upgraded in November 2018, the error occurred first in January 2018. The errorline when it collapsed was

Feb 1 23:45:12 2018 fwm-name1 kernel: dst cache overflow

The problem in the linux-kernel was CVE-2009-0778.

Maybe this helps someone to identify such errors.

Kaspars_Zibarts · ‎2018-03-23

In the case RAM seems under control from screenshots? Roughly 25% used and it happened straight after upgrade from what I gather. All I can see it's nothing is accelerated

[Expert@LMPRCPFW01:0]# fwaccel stats -s
Accelerated conns/Total conns : 0/94 (0%)
Accelerated pkts/Total pkts : 0/6854226 (0%)
F2Fed pkts/Total pkts : 6854226/6854226 (100%)
PXL pkts/Total pkts : 0/6854226 (0%)
QXL pkts/Total pkts : 0/6854226 (0%)

Jason_Dance · ‎2018-03-23

I recall a snippet in a discussion over the past few months about bringing the kernel to 2.6. Has anyone heard of any updates in this area?

AlekseiShelepov · ‎2018-03-24

But Check Point R77.30 is already based on 2.6 kernel. Although, not on 2.6.25 as in the CVE info, but on 2.6.18.

I think you are interested in this:

New Early Availability Program - New Gaia OS for Security Management Server

Comment by Ofer Raz:

We intend to move to a newer Linux Kernel in a phased manner, starting with the management products, then security gateways later on. The Linux kernel will be based on version 3.10.

Henrique_Sauer_ · ‎2018-03-25

Hello Christian Froemmel,

We are going through the exact same problem, after upgrade to r80.10 we have been in the exact same problem.

Could you provide me the Service Request number, I will update my ticket with CKP Support with this information it may be really helpfull.

We even deployed a new appliance with r80.10 and the problem is still the same. So tomorrow we'd better fix it before morning to not affect the company.

Please update me with the information and the fix as soon as you can.

I will appreciate it.

Regards

Ben_Fung · ‎2018-06-21

Hi All,

We finally did some testing with local support, we did the following test:

1) we swapped to new hardware with fresh install of R80 and restored our config, the latency problem remains.

2) we found that the latency problem was gone when we disconnected cable at interface eth1-03.

3) We thought it might be a loading issue so we shutdown all the related VM/Servers coming from the eth1-03. Latency remain.

4) eth1-03 is connected to our core switch so tried to switch to another port. The port speed is set to auto negotiation/1000/duplex full at checkpoint and core switch. Latency remain.

Any sugguestion on what might be the cause here? Thanks guys!

Vladimir · ‎2018-06-21

Please check if you have dynamic routing enabled on the firewall.

If yes, check if the routing table changes when eth1-03 is connected by more than one "Connected" route.

Check also if Anti-Spoofing is enabled on all of the interfaces and is configured to drop the traffic not from associated network or group.

If dynamic routing is enabled and anti-spoofing is not properly defined, you may be experiencing routing loops or asymmetric routing that may result in higher latency.

Ben_Fung · ‎2018-06-24

Thanks Vladimir, we don't have dynamic routing enabled but we do have policy route enabled. For anti-spoofing, we set it to "Detect" so I guess the firewall is not dropping any traffic because of anti-spoofing? Logs also didn't show any dropping icmp traffic when we did the ping test.

Vladimir · ‎2018-06-26

Do NOT disable antispoofing. Enable it and see what traffic will be affected by it.

Enable "Log Implied Rules" before doing it to see what, if any, traffic is being dropped.

Make sure that ICMP is enabled in Global Properties and the rules with logging, if you expect to track it.

Post here your routing tables with interface disconnected and connected to see what, if anything is changes.

Ben_Fung · ‎2018-06-24

On that note, we tried disabling the anti-spoofing on eth1-03. Latency remains.

Jason_Dance · ‎2018-06-25

If eth1, 2 and 3 are connected to your core switch, are you using port aggregation to group them all into a virtual interface? Or are they all set to different vlans?

Ben_Fung · ‎2018-06-27

Some more update. We did more testing yesterday. We found that high latency occurs when we have our fiber ports and/or port 3 connected. Fiber ports are eth2-01 and 02 (Server zone), port 3 is Webzone. So basically latency occurs whenever there is loading via the firewall. Checkpoint support analyzed our captures and mentioned that the RX bytes for our interface is quite high:

this is subinterface under eth2-01/02:

bond1.510 Link encap:Ethernet HWaddr xxxxxxxxxxxx
inet addr:x.x.x.x Bcast:x.x.x.x Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:1793386487 errors:0 dropped:0 overruns:0 frame:0
TX packets:242101285 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1689407119104 (1.5 TiB) TX bytes:21139698282 (19.6 GiB)

1.5 tebibytes with uptime of 6 days (ps: this is a new hardware).

checkpoint suggests to adjust the interface ring size and the inpute queue to the following but it doesn't help:

> show interface eth1-03 rx-ringsize
Receive buffer ring size:4096

> fw -i k ctl get int fwmultik_input_queue_len
fwmultik_input_queue_len = 2048
>

fwmultik_input_queue_len = 2048

Edson_Oliveira · ‎2018-08-22

Hello Ben,

I'm am with the same problem right now, did you have your problem solved ? Thank you !

Teo_Moins · ‎2018-08-23

Hi,

I'm not 100% sure we are having the same issue, but I still want to share my own experience.

We've migrated from an R77.30 physical cluster to a R80.10 VSX a few weeks ago and started to have some alerts in our monitoring tools that told us the latency was increasing (even when failing traffic over or rebooting the VSX, and even with the latest combo hotfix).

Directly to the findings: a 'tcpdump' is showing the "latency", a 'fw monitor' is showing many duplicate frames on little "i"

(of i-I-o-O) with increasing delta-T between each duplicate.

Sometimes one duplicate, and sometimes up to 4 or 5 duplicates.

CP TAC tells us "kernel corruption", and asks us to do a "reset_gw" (sk101690) on both VSX, so I do this morning.

I would really be happy to tell it's resolved, but after some captures, it is definitely not.

Are you a member of CheckMates?

High latency after Check Point firewall from R77.30 to R80.10