Solved: Re: 15600 gw high memory utilization

AigarsK · ‎2023-03-28

Hi All,

This is quite typical post looking at all previous ones I have seen, I am making it as believe (as all of us individuals) that previous cases do not address condition I am seeing.

We have two 15600 gateways in Active/Standby cluster. Active gateway quite often would run memory utilization which is above 80%. Currently on R81.10 with Jumbo Hotfix 87 deployed.

free -mt
total used free shared buff/cache available
Mem: 15692 11703 1623 4 2365 2204
Swap: 18449 713 17735
Total: 34141 12417 19359

enabled_blades: fw urlf av appi ips identityServer anti_bot mon

fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 15 | 926 | 1991
1 | Yes | 30 | 920 | 1909
2 | Yes | 14 | 941 | 1926
3 | Yes | 29 | 916 | 1781
4 | Yes | 13 | 957 | 1847
5 | Yes | 28 | 902 | 1970
6 | Yes | 12 | 898 | 1900
7 | Yes | 27 | 902 | 1881
8 | Yes | 11 | 906 | 1925
9 | Yes | 26 | 865 | 1876
10 | Yes | 10 | 906 | 1937
11 | Yes | 25 | 962 | 1904
12 | Yes | 9 | 963 | 1936
13 | Yes | 24 | 920 | 1891
14 | Yes | 8 | 918 | 1910
15 | No | - | 269 | 1915
16 | No | - | 267 | 1877
17 | Yes | 22 | 990 | 1899
18 | Yes | 6 | 961 | 1928
19 | Yes | 21 | 930 | 1964
20 | Yes | 5 | 1007 | 1889
21 | Yes | 20 | 937 | 1889
22 | Yes | 4 | 950 | 1943
23 | Yes | 19 | 878 | 1868
24 | Yes | 3 | 975 | 1960
25 | Yes | 18 | 905 | 1890
26 | Yes | 2 | 899 | 1915

At the time of writing I am seeing 86% memory utilization and reported concurrent connection count of around 23000

We are also using our Check Point gateways as proxy for client traffic which is pushed to client bowsers using GPO and PAC file config. This is our poor man's means for ensuring that remote workers have traffic passed through HQ before decision of allow of deny is made (All clients are running Always On VPN to reach Check Point gateways) P.S. We are migrating away from this solution over to Cloud Proxy.

Worth to mention that Check Point is our internal firewall which is sitting between various VRF's hosting user networks, guest, and servers networks and all traffic on Internet Edge is handled by pair of Cisco Firepower boxes. User traffic being passed from Check Point Outside Interface to what is considered as Firepower Inside interface.

Our Internet Edge would of course see twice as much concurrent connections if there is large about of remote workers out about or just working from home, We have noticed that when memory utilization is high this also grows Internet Edge connections, but by factor of 3, We see lots of "First packet not TCP-SYN" but there is no asymmetric routing as such as there are evidence that connection on particular port existed on Check Point firewall couple of minutes ago, suspect that when Memory utilization is high it kills of connections but does not notify client nor server and client just keeps on chatting away expecting same session being active.

User bowser session to external resources sometimes fail to load the content and display message Err_SSL_Protocol_Error. Sometimes this being displayed briefly and the proceeds with loading webpage.

We had logged case with our Check Point value add reseller support which eventually raised with Check Point Tac, which did lead to Check Point account manager calling me, discussing the case and proposing to close it after we advised that we are still sticking with this firewall and that we are not interested in additional memory. Later received email stating that they want to close the ticket as we are pushing firewall beyond expected load so means of fix is buying more memory or getting upgrade.

Which is interesting by itself as we recently ran cpsizeme script which determined that we are underutilizing our firewall, granted this was done while we were still running R81, and advised us that way forward would be with 6700.

So I am sort of left in purgatory, nothing further is progressed relating to ticket with Check Point Tac, options on buying more RAM, when I am not convinced why cpsizeme indicated that lower tier new gateway would suffice.

This is after last email I received stating follwoing:

---------------------------

Check Point have come back to us and confirmed the following:

There is a memory leak and it is in the Hash Kernel Memory (HMEM) as confirmed by the TAC team and R&D
R&D have confirmed this issue can occur in the HMEM and the solution is as follows:

Move the firewall from running in Kernel Space to run in User Space, with the motivation being to improve memory utilization - SK167052

---------------------------

We implemented SK167052 and faced some issues after firewall reboot where sync was not taking place, it resolved itself after waiting an hour, now there are issues failing over active Firewal, separate case logged, but have feeling that this ongoing issue with Memory Utilization is still culprit to the issues we are seeing.

Would appreciate any help here to try to determine which process is consuming all the memory and pushing it in swap territory.

Is there potential that gateways are suffering by not having enough PAT translation space, considering that all Proxy clients would originate traffic from Check Point gateway outside interface?

PhoneBoy · ‎2023-03-29

One of the things about proxy mode is that you basically double the number of connections the gateway has to process.
NAT doubles it again.
Even a modest increase in the number of users or connections could put you in a situation where you're running out of kernel memory.
What's the connection table look like numbers wise when the problem is occurring?

cpsizeme and other similar tools do not take into account usage as an explicit proxy.
If this configuration would remain long-term, I would recommend a conversation with your Check Point SE.
Given this is effectively a short-term problem, I can see why you want to avoid buying more memory if you have to.

View solution in original post

PhoneBoy · ‎2023-03-28

Am I reading correctly that you’re using the Check Point as an explicit proxy for your Remote Access clients?
Explicit proxy mode has very different performance characteristics:
https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...
It’s also not a configuration we generally recommend.

AigarsK · ‎2023-03-29

Many Thanks for sharking the link, as mentioned we are looking to move to Cloud Proxy, just waiting last signature to get access to the tool.

I appreciate that this is performance taxing solution, it did well during lockdown period and had been running since 2019 without ever having this level of performance issues, this is why it comes at surprise that it is not capable to cope now

PhoneBoy · ‎2023-03-29

One of the things about proxy mode is that you basically double the number of connections the gateway has to process.
NAT doubles it again.
Even a modest increase in the number of users or connections could put you in a situation where you're running out of kernel memory.
What's the connection table look like numbers wise when the problem is occurring?

cpsizeme and other similar tools do not take into account usage as an explicit proxy.
If this configuration would remain long-term, I would recommend a conversation with your Check Point SE.
Given this is effectively a short-term problem, I can see why you want to avoid buying more memory if you have to.

AigarsK · ‎2023-03-29

Thanks for the advice.

AigarsK · ‎2023-03-30

Hi PhoneBoy,

Just wanted to check something with you. So I encountered another High memory utilization even, looks like Active gateway was already using SWAP memory, and I pushed Policy Install which took it even further and was 89%.

I checked and noticed that Aggressive Ageing was active. I had also some connection drops, but they would be considered as new connections where web browser again gave me "Err_SSL_Protocol_Error"

I ended taking down this gateway with CPSTOP as performing priority change and push from SmartConsole last time I did this resulted in Active/Active firewall state (case logged).

Part I do not understand is why SWAP still used:

free -mt

total used free shared buff/cache available

Mem: 15692 5216 8127 4 2348 8639

Swap: 18449 247 18202

Total: 34141 5463 26329

Also, looking at both gateways, when Active one has its memory at or above 80%, Standby gateway has memory at 72 to 76% and it is also reporting that SWAP is used. I accept that they sync connections, but if it is not actively forwarding traffic, why would its SWAP memory be used?

Timothy_Hall · ‎2023-03-30

In my experience allocated swap memory is never freed until reboot. Once it is allocated it will stay that way even if there is now plenty of available RAM, I believe this is due to the fact that swap space allocation is a somewhat expensive operation that Linux doesn't want to do more than it needs to.

When you say active is 80% memory utilized and standby is 72-76% utilized how are you computing that? Ignore the value reported for "free"; for the Mem line what is (available/total)*100? Your free -mt output indicates to me you are using the "free" value for this calculation which does not mean what you think it means.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

AigarsK · ‎2023-03-31

Thanks for your explanation on SWAP memory utilization. Regarding memory utilization in first instance I am looking at Smart Console, usually when I see it over 80% I checked for Aggressive Aging state, when active I know I am in trouble.

I have gone about reverting change I did for moving firewalls running in User Space back to Kernel Space as since I had done this change I have seen more days where I am in red than ever before.

Friday traffic is not usually one to benchmark against.

I appreciate everyone input on this post, I will continue to work with our support provider and work on migrating away from proxy at higher priority.

Chris_Atkinson · ‎2023-03-28

What blades if any are to be disabled on this system with the proxy migration and how much NAT is it performing otherwise?

From solely a memory expansion perspective the 6700 would constrain you to 32GB max.
15600 Max memory population: 64GB
6900 Max memory population: 64GB

Some helpful resources may include:

sk60768: How to reject out of state packets

https://community.checkpoint.com/t5/General-Topics/R80-x-Performance-Tuning-Tip-Connection-Table/td-... (Tip 7 can be a quick win in some cases).

CCSM R77/R80/ELITE

AigarsK · ‎2023-03-29

Thanks Chris,

I did go about implementing Tip 7, good resources in general, have made note of some of the commands

Sorin_Gogean · ‎2023-03-30

Hello @AigarsK ,

We've also had memory "issues" on our 15600 clusters (6 appliances in total) and after a bunch of tweaks we went from 16Gb to 32Gb, and all was good for a while, and recently (like couple of months) we went to MAX 64Gb .

All I can tell is that even with 32Gb, the memory was close to 80% (like 24-25Gb utilization) and therefore we decided to go with 64Gb. Compared with you, we have SSL Inspection that memory-wise would be the same utilization as proxy - I guess....

[Expert@AxxA-FW02:0]# enabled_blades
fw urlf av appi ips identityServer SSL_INSPECT anti_bot ThreatEmulation mon
[Expert@AxxA-FW02:0]#

[Expert@AxxA-FW02:0]# free -mt
total used free shared buff/cache available
Mem: 63684 23704 17799 7 22180 38157
Swap: 32143 0 32143
Total: 95828 23704 49943
[Expert@AxxA-FW02:0]#

Currently we're with 64Gb, like I said, and you can see the utilization above.

Thank you,

PhoneBoy · ‎2023-03-30

HTTPS Inspection causes a similar impact to explicit proxy in that it doubles the number of connections to track.
Therefore, the memory impact is probably similar.

HTTPS Inspection should perform better than explicit proxy mode as:

HTTPS Inspection is performed in Medium Path
Explicit Proxy support occurs via Slow (F2F) Path

Are you a member of CheckMates?

15600 gw high memory utilization