Hi All,
This is quite typical post looking at all previous ones I have seen, I am making it as believe (as all of us individuals) that previous cases do not address condition I am seeing.
We have two 15600 gateways in Active/Standby cluster. Active gateway quite often would run memory utilization which is above 80%. Currently on R81.10 with Jumbo Hotfix 87 deployed.
free -mt
total used free shared buff/cache available
Mem: 15692 11703 1623 4 2365 2204
Swap: 18449 713 17735
Total: 34141 12417 19359
enabled_blades: fw urlf av appi ips identityServer anti_bot mon
fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 15 | 926 | 1991
1 | Yes | 30 | 920 | 1909
2 | Yes | 14 | 941 | 1926
3 | Yes | 29 | 916 | 1781
4 | Yes | 13 | 957 | 1847
5 | Yes | 28 | 902 | 1970
6 | Yes | 12 | 898 | 1900
7 | Yes | 27 | 902 | 1881
8 | Yes | 11 | 906 | 1925
9 | Yes | 26 | 865 | 1876
10 | Yes | 10 | 906 | 1937
11 | Yes | 25 | 962 | 1904
12 | Yes | 9 | 963 | 1936
13 | Yes | 24 | 920 | 1891
14 | Yes | 8 | 918 | 1910
15 | No | - | 269 | 1915
16 | No | - | 267 | 1877
17 | Yes | 22 | 990 | 1899
18 | Yes | 6 | 961 | 1928
19 | Yes | 21 | 930 | 1964
20 | Yes | 5 | 1007 | 1889
21 | Yes | 20 | 937 | 1889
22 | Yes | 4 | 950 | 1943
23 | Yes | 19 | 878 | 1868
24 | Yes | 3 | 975 | 1960
25 | Yes | 18 | 905 | 1890
26 | Yes | 2 | 899 | 1915
At the time of writing I am seeing 86% memory utilization and reported concurrent connection count of around 23000
We are also using our Check Point gateways as proxy for client traffic which is pushed to client bowsers using GPO and PAC file config. This is our poor man's means for ensuring that remote workers have traffic passed through HQ before decision of allow of deny is made (All clients are running Always On VPN to reach Check Point gateways) P.S. We are migrating away from this solution over to Cloud Proxy.
Worth to mention that Check Point is our internal firewall which is sitting between various VRF's hosting user networks, guest, and servers networks and all traffic on Internet Edge is handled by pair of Cisco Firepower boxes. User traffic being passed from Check Point Outside Interface to what is considered as Firepower Inside interface.
Our Internet Edge would of course see twice as much concurrent connections if there is large about of remote workers out about or just working from home, We have noticed that when memory utilization is high this also grows Internet Edge connections, but by factor of 3, We see lots of "First packet not TCP-SYN" but there is no asymmetric routing as such as there are evidence that connection on particular port existed on Check Point firewall couple of minutes ago, suspect that when Memory utilization is high it kills of connections but does not notify client nor server and client just keeps on chatting away expecting same session being active.
User bowser session to external resources sometimes fail to load the content and display message Err_SSL_Protocol_Error. Sometimes this being displayed briefly and the proceeds with loading webpage.
We had logged case with our Check Point value add reseller support which eventually raised with Check Point Tac, which did lead to Check Point account manager calling me, discussing the case and proposing to close it after we advised that we are still sticking with this firewall and that we are not interested in additional memory. Later received email stating that they want to close the ticket as we are pushing firewall beyond expected load so means of fix is buying more memory or getting upgrade.
Which is interesting by itself as we recently ran cpsizeme script which determined that we are underutilizing our firewall, granted this was done while we were still running R81, and advised us that way forward would be with 6700.
So I am sort of left in purgatory, nothing further is progressed relating to ticket with Check Point Tac, options on buying more RAM, when I am not convinced why cpsizeme indicated that lower tier new gateway would suffice.
This is after last email I received stating follwoing:
---------------------------
Check Point have come back to us and confirmed the following:
- There is a memory leak and it is in the Hash Kernel Memory (HMEM) as confirmed by the TAC team and R&D
- R&D have confirmed this issue can occur in the HMEM and the solution is as follows:
Move the firewall from running in Kernel Space to run in User Space, with the motivation being to improve memory utilization - SK167052
---------------------------
We implemented SK167052 and faced some issues after firewall reboot where sync was not taking place, it resolved itself after waiting an hour, now there are issues failing over active Firewal, separate case logged, but have feeling that this ongoing issue with Memory Utilization is still culprit to the issues we are seeing.
Would appreciate any help here to try to determine which process is consuming all the memory and pushing it in swap territory.
Is there potential that gateways are suffering by not having enough PAT translation space, considering that all Proxy clients would originate traffic from Check Point gateway outside interface?