Solved: High memory usage on 1570/1590

Morten_O · ‎2024-06-18

Hi,

we have lately hardware-refreshed a lot of 1400-appliance to mainly 1570 and 1590 models.

All are now running R81.10.10 build 996002945

A few times, we have had reports, that the appliances becomes unresponsive (not even answering ping, ssh or webui) and has to be power-cycled to start working again.

So I checked at multiple customers, and I can see, that they are all running with very high memory utilization - above 80%.

All are centrally managed, and I have seen this at multiple customers - so very different policies etc.

One of the customers is not even running IPS, which is known for intense utilization (at least on the 1400-appliances).

Are others seeing the same? Can it be a memoryleak or....?

I already opened a SR (where first recommendation was to upgrade......), but I was just interested in hearing if I'm the only one seeing this picture.

belgur · ‎2024-06-28

Hi

i have the same problem. TAC report to me the problem is hcp script in management server.
Next week will be release a nev version as can be found in HealthCheck Point (HCP) Release Updates

View solution in original post

Amir_Ayalon · ‎2024-07-03

Hi All

High memory usage on Centrally managed SMB is caused by HCP (healthcheck point) in management server.

HCP tries to run some python commands on SMB appliance which doesn't support Python. Due to this, multiple sfwd instances were created and memory was not released.

Customers will start to get the automatic HCP update in the next few days. Meanwhile, you can also update the version manually using below steps,

Download HCP TAR

https://support.checkpoint.com/results/download/134058

In Expert mode,

Run:

# autoupdatercli install <Full Path to the TAR Package

run this command to verify hcp version hcp -v

you should see or higher build:

HCP Take: 58
HCP RPM Build: hcp-1-592320.i386
hcp-1-592021.i386

Reboot the problematic SMB appliance to free the allocated memory.
Once the SMB gw is up, push policy again.

View solution in original post

Lesley · ‎2024-06-18

High memory load could be normal on Linux based systems. Maybe share a top output. You should focus on swap. High swap could be an indication of highly loaded systems.

Tasks: 280 total, 3 running, 175 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.1%us, 7.7%sy, 0.0%ni, 79.8%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Mem: 8024384k total, 7229184k used, 795200k free, 208448k buffers
Swap: 0k total, 0k used, 0k free, 1125312k cached

Here an example. As you can see swap is 0k and memory itself looks loaded but it is not due low swap.

Second, the way to check if there is a memory leak is to monitor this with a monitoring tool. I use this to see the memory in a graph from longer period of time. If you see a small increase every day for example it is an indication of a memory leak.

Third, maybe pay some attention to the SFWD daemon. If have see people with problems related to this daemon, that it is restarting and causing issues.

cat $FWDIR/log/sfwd.elg

cat $FWDIR/log/sfwd.elg | grep SFWD

cat $FWDIR/log/sfwd.elg | grep -i '360 MB' -A 1

cpwd_admin list (check here if the SFWD is restarting

4 point maybe try to run a doctor spark on a problematic gateway, maybe you can find something in this health check.

https://sc1.checkpoint.com/documents/SMB_R81.10.X/AdminGuides_Locally_Managed/EN/Content/Topics/DrSp...

-------
Please press "Accept as Solution" if my post solved it 🙂

Morten_O · ‎2024-06-18

Hi,

I searched through a sfwd.elg yesterday, and didn't find anything of interest, and no 360 MB entries, and not watchdog-restarts.

The one I searched is the same as this output is from - it's from a 1570 yesterday, and this morning it had to be power-cycled because it was unresponsive. As you can see, it's not doing much from a connection-perspective, but it sure uses some memory and a little swap.....

[Expert]#free
total used free shared buff/cache available
Mem: 1958900 1756676 72792 54644 129432 79332
Swap: 524284 524284 0

[Expert]# fw ctl pstat

System Capacity Summary:
Memory used: 39% (527 MB out of 1339 MB) - below watermark
Concurrent Connections: 1% (2476 out of 149900) - below watermark
Aggressive Aging is disabled

Hash kernel memory (hmem) statistics:
Total memory allocated: 343760896 bytes in 83926 (4096 bytes) blocks using 3 pools
Initial memory allocated: 276824064 bytes (Hash memory extended by 66936832 bytes)
Memory allocation limit: 629145600 bytes using 512 pools
Total memory bytes used: 0 unused: 343760896 (100.00%) peak: 316881572
Total memory blocks used: 0 unused: 83926 (100%) peak: 80954
Allocations: 3937903068 alloc, 0 failed alloc, 3936443903 free

System kernel memory (smem) statistics:
Total memory bytes used: 647431900 peak: 681652464
Total memory bytes wasted: 5449467
Blocking memory bytes used: 3048884 peak: 3275656
Non-Blocking memory bytes used: 644383016 peak: 678376808
Allocations: 8919614 alloc, 1 failed alloc, 8915235 free, 0 failed free
vmalloc bytes used: 640774312 expensive: no

Kernel memory (kmem) statistics:
Total memory bytes used: 473496748 peak: 604930896
Allocations: 3946820926 alloc, 1 failed alloc
3945359055 free, 0 failed free
External Allocations:
Packets: 286680, SXL: 3917423, Reorder: 0
Zeco: 0, SHMEM: 4320, Resctrl: 0
ADPDRV: 0, PPK_CI: 1272416, PPK_CORR: 0

Lesley · ‎2024-06-18

Is this output from BEFORE or AFTER reboot. Because if it is unresponsive I can imagine you are not able to can get this type of output. It is important to get this information before the reboot if there are memory related issues.

-------
Please press "Accept as Solution" if my post solved it 🙂

Morten_O · ‎2024-06-18

Output is from yesterday, where the mem-usage was high - I use the info in the SR. The appliances was unresponsive this morning.

Taki183 · ‎2024-06-19

We have a similar problem. Model 1590, running R81.10
It started a week ago. Memory usage 90-98%

Output from "free -m"

Click to Expand

total used free shared buff/cache available
Mem: 1816588 1671316 50120 61384 95152 33180
Swap: 0 0 0

Output from "top" sorted by Memory

Click to Expand

20594 root 20 0 409m 201m 16m S 8 11.3 66:05.64 fw
8331 root 20 0 343m 113m 1076 S 0 6.4 0:00.01 fw
24360 root 20 0 332m 101m 1096 S 0 5.7 0:00.01 fw
7077 root 20 0 331m 101m 1096 S 0 5.7 0:00.00 fw
22315 root 20 0 330m 100m 1032 S 0 5.6 0:00.01 fw
5150 root 20 0 329m 99m 1096 S 0 5.6 0:00.01 fw
20462 root 20 0 329m 98m 1096 S 0 5.5 0:00.01 fw
21845 root 20 0 312m 83m 1012 S 0 4.7 0:00.01 fw
26912 root 20 0 303m 74m 1040 S 0 4.2 0:00.01 fw
2891 root 20 0 281m 52m 1012 S 0 3.0 0:00.02 fw
8406 root 20 0 132m 23m 2036 S 0 1.3 2:09.24 cpview_api_serv
4998 root 20 0 23948 21m 2228 S 0 1.2 0:00.07 newSfwsh.bin
8410 root 20 0 144m 19m 2136 S 0 1.1 0:11.16 cpviewd
3787 root 0 -20 45808 14m 3004 S 0 0.8 25:33.82 cposd
5067 root 20 0 32064 12m 2420 S 0 0.7 0:00.08 runCliCommand.l

20594 root 20 0 409m 201m 16m S 8 11.3 66:05.64 fw8331 root 20 0 343m 113m 1076 S 0 6.4 0:00.01 fw24360 root 20 0 332m 101m 1096 S 0 5.7 0:00.01 fw7077 root 20 0 331m 101m 1096 S 0 5.7 0:00.00 fw22315 root 20 0 330m 100m 1032 S 0 5.6 0:00.01 fw5150 root 20 0 329m 99m 1096 S 0 5.6 0:00.01 fw20462 root 20 0 329m 98m 1096 S 0 5.5 0:00.01 fw21845 root 20 0 312m 83m 1012 S 0 4.7 0:00.01 fw26912 root 20 0 303m 74m 1040 S 0 4.2 0:00.01 fw2891 root 20 0 281m 52m 1012 S 0 3.0 0:00.02 fw8406 root 20 0 132m 23m 2036 S 0 1.3 2:09.24 cpview_api_serv4998 root 20 0 23948 21m 2228 S 0 1.2 0:00.07 newSfwsh.bin8410 root 20 0 144m 19m 2136 S 0 1.1 0:11.16 cpviewd3787 root 0 -20 45808 14m 3004 S 0 0.8 25:33.82 cposd5067 root 20 0 32064 12m 2420 S 0 0.7 0:00.08 runCliCommand.l

Logs from "sfwd.elg". I don't understand if there is anything suspicious here

Click to Expand

fwstatagent_check_sdwan: Getting sdwan interfaces
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=0 dis_s=1053044 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=1
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=1 dis_s=1048755 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=2
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
08:57:25.305790 Conversion start: probing_hb.conv
fwobj_get_converted_set_as_string: Conversion failed with error: /usr/local/share/lua/5.1/conversion/convert.lua:0: /usr/local/share/lua/5.1/conversion/convert.lua:0: conversion.convert.handle_match: probingServer in /opt/fw1/conf/probing_hb.conv
fwobj_get_converted_set_as_string: Failed to create result_string
08:57:25.516413 Conversion end : probing_hb.conv
fwobj_get_converted_set: Failed to create fwset instance
ProbingServerStatus::getProbingServerStatus: Error reading the probing servers file.
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=2 dis_s=1049785 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=3
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces

fwstatagent_check_sdwan: Getting sdwan interfacesbr_names: failed popenfwstatagent_check_sdwan: Getting sdwan interfacesringdir_append: ringdir-debug slot=0 dis_s=1053044 is bigger than max_slot_size=1048576ringdir_append: ringdir-debug move to slot=1br_names: failed popenfwstatagent_check_sdwan: Getting sdwan interfacesringdir_append: ringdir-debug slot=1 dis_s=1048755 is bigger than max_slot_size=1048576ringdir_append: ringdir-debug move to slot=2br_names: failed popenfwstatagent_check_sdwan: Getting sdwan interfaces08:57:25.305790 Conversion start: probing_hb.convfwobj_get_converted_set_as_string: Conversion failed with error: /usr/local/share/lua/5.1/conversion/convert.lua:0: /usr/local/share/lua/5.1/conversion/convert.lua:0: conversion.convert.handle_match: probingServer in /opt/fw1/conf/probing_hb.convfwobj_get_converted_set_as_string: Failed to create result_string08:57:25.516413 Conversion end : probing_hb.convfwobj_get_converted_set: Failed to create fwset instanceProbingServerStatus::getProbingServerStatus: Error reading the probing servers file.br_names: failed popenfwstatagent_check_sdwan: Getting sdwan interfacesringdir_append: ringdir-debug slot=2 dis_s=1049785 is bigger than max_slot_size=1048576ringdir_append: ringdir-debug move to slot=3br_names: failed popenfwstatagent_check_sdwan: Getting sdwan interfaces

Logs from "cpwd.elg"

Click to Expand

[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:36:58] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).
[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:37:08] [SUCCESS] SFWD started successfully (pid=669)
[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:08:56] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).
[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:09:06] [SUCCESS] SFWD started successfully (pid=20594)

[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:36:58] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:37:08] [SUCCESS] SFWD started successfully (pid=669)[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:08:56] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:09:06] [SUCCESS] SFWD started successfully (pid=20594)

I suspect the sfwd process but I can’t figure out why it crashes. What exactly causes it to restart?

andwen · ‎2024-06-21

We have seen this behavior as well, on model 1550 and 1590 (not on 1530 as far as we can see) with firmware-version R81.10.08 and R810.10.10 (both the latest version with the CVE-fix). If you'd check with "ps aux | grep "[f]w sfwd", I am guessing you will see multiple instances of this process with only one running. The total amount of memory (shown in percentages) that these processes are using will probably be only slightly less than the total amount of memory that is in use.

We have started a TAC-case for this issue and R&D is working on it.

PhoneBoy · ‎2024-06-21

What does cpwd_admin list say?

Taki183 · ‎2024-06-21

Nothing interesting I guess

Click to Expand

APP PID STAT #START START_TIME MON COMMAND
RNGD 3196 E 1 [19:54:13] 25/5/2024 N /pfrm2.0/bin/jitterentropy_rngd -v
SSHD 3703 E 1 [19:54:15] 25/5/2024 N /pfrm2.0/bin/sshd -f /pfrm2.0/etc/sshd_config -p 22 -D
cposd 3787 E 1 [19:54:17] 25/5/2024 N cposd
RTDB 3814 E 1 [19:54:17] 25/5/2024 N rtdbd
SFWD 12320 E 1 [20:39:40] 21/6/2024 N fw sfwd
CVIEWAPIS 8406 E 1 [19:54:35] 25/5/2024 N cpview_api_service
CPVIEWD 8410 E 1 [19:54:35] 25/5/2024 N cpviewd

APP PID STAT #START START_TIME MON COMMANDRNGD 3196 E 1 [19:54:13] 25/5/2024 N /pfrm2.0/bin/jitterentropy_rngd -vSSHD 3703 E 1 [19:54:15] 25/5/2024 N /pfrm2.0/bin/sshd -f /pfrm2.0/etc/sshd_config -p 22 -Dcposd 3787 E 1 [19:54:17] 25/5/2024 N cposdRTDB 3814 E 1 [19:54:17] 25/5/2024 N rtdbdSFWD 12320 E 1 [20:39:40] 21/6/2024 N fw sfwdCVIEWAPIS 8406 E 1 [19:54:35] 25/5/2024 N cpview_api_serviceCPVIEWD 8410 E 1 [19:54:35] 25/5/2024 N cpviewd

Pedro_Espindola · ‎2024-06-21

We have the same behavior in a centrally managed 1570 cluster. This is in a small network with not a lot of traffic. The system was perfectly fine in R80.20.50, took days to reach 85% RAM and didn't go above that unless it was stressed. The problem started right after upgrading to R81.10.X, reaching 95-100% a few minutes after reboot for no reason.

Marquevis · ‎2024-06-22

Hello,

Run the command below and see which processes are consuming the most memory.

ps -eo size,pid,user,command --sort -size | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' > /home/admin/memory .txt

scordy · ‎2024-06-23

Seeing exactly the same thing here on a fleet of 1570R's running R81.10.10. Output from your command above:

You are in expert mode now.

<Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'
0.00 Mb COMMAND
346.38 Mb fw sfwd
344.23 Mb fw sfwd
342.04 Mb fw sfwd
340.37 Mb fw sfwd
337.66 Mb fw sfwd
335.89 Mb fw sfwd
335.27 Mb fw sfwd
220.74 Mb fw sfwd
218.10 Mb fw sfwd
217.37 Mb fw sfwd
217.32 Mb fw sfwd
216.98 Mb fw sfwd
216.60 Mb fw sfwd
214.54 Mb fw sfwd
210.98 Mb fw sfwd
193.66 Mb fw sfwd
77.47 Mb cpviewd

And gets boring after that. So "fw sfwd" appears to be holding all the mem. Just logged a TAC call.

Amir_Ayalon · ‎2024-06-23

Hi All

updating you that after investigation this seems to be a management issue (not related to the new Spark firmware)

we are taking it with MT owners.

Thanks

andwen · ‎2024-06-24

Hi Amir,

We did not see this high memory with the R81.10.x-versions that weren't patched. This high memory-usage was first seen after we installed build 1750 and 2945, so I am curious to how this is deemed a management issue.

Other than that, I hope this can be resolved quickly 🙂

PhoneBoy · ‎2024-06-24

The underlying issue could be caused by something pushed from the management.
Unless these are units managed only through the local WebUI or Infinity Portal (not with Smart-1 Cloud or on-prem management)?

Amir_Ayalon · ‎2024-06-24

exactly,

This is one of the Agents that is being updated Automatically.

(hopefully tomorrow an update in this Agent will also be pushed Automatically, and the issue will be resolved)

Morten_O · ‎2024-06-24

That sounds great - can you elaborate what the problem is/was, and how to check if the relevant part has been updated tomorrow, or even how to force an update?

Kdeo · ‎2024-06-25

Hi @Amir_Ayalon .

Do you have any updates? Can we check if this issue was solved?

Thanks!

scordy · ‎2024-06-26

Hi @Amir_Ayalon,

2nded! Are there any updates with this? How do we confirm this has been resolved?

Thanks.

MikeB · ‎2024-06-27

Could this problem be solved? any update on this?

GuruHexa · ‎2024-06-28

Hello, I have the same problem here with a 1570 appliance R80.20 with central management.
Is there a fix or version that solves it?

# ps aux | grep "fw sfwd"
root 5461 0.0 6.7 403608 132564 ? S 12:56 0:00 fw sfwd
root 6013 2.5 8.0 403608 158556 ? Ssl Jun27 32:30 fw sfwd
root 8355 0.0 0.0 4452 812 pts/0 S+ 13:39 0:00 grep fw sfwd
root 11467 0.0 5.7 391764 112660 ? S 06:56 0:00 fw sfwd
root 14986 0.0 5.4 395164 106612 ? S Jun27 0:00 fw sfwd
root 29252 0.0 5.6 399132 111132 ? S 00:56 0:00 fw sfwd

Regards

andwen · ‎2024-06-28

Yes, there is a fix that needs to be implemented on the management-server. After that, you can perform a "killall fw" (via cprid_util or manually per gateway) to free the memory and this should remain stable. At least in our environment , that seems to be the case.

GuruHexa · ‎2024-06-28

Thank you very much!
Do you know which fix to implement on the management server?

Thanks

Pedro_Espindola · ‎2024-06-24

We had no issues with locally managed gateways. Only centrally managed were affected, seen in R81.10.08, even before the CVE patch.

RS_Daniel · ‎2024-06-24

Hello,

Yes, same problem here, we see these multiple SFWD instances on 1500/1600 and 1800 appliancess too, all of them centrally managed. Is there any other way to check if the update/fix was applied? or only with top/ps looking for SFWD?

Regards

belgur · ‎2024-06-28

Hi

i have the same problem. TAC report to me the problem is hcp script in management server.
Next week will be release a nev version as can be found in HealthCheck Point (HCP) Release Updates

Pedro_Espindola · ‎2024-06-28

Apparently there is a new version from June 25th, but no notes about it in the list of resolved issues yet.

lrossi89 · ‎2024-06-28

The TAC has provided us this HCP-1-592320.I386.RPM version and it seems to be Okay, now we await a few days

Pedro_Espindola · ‎2024-06-28

Thank you for the information! So it is not take 72 from June 25th, which is build 592042. Let's wait for the new one.

FerPr0c03 · ‎2024-07-01

Hi @lrossi89 , Can you provide us with the .rpm file?
I have an open case with TAC for the same problem, it is already escalated to T3, but I don't get a response from TAC regarding it.
This is very critical for us and we need to fix it as soon as possible.

Are you a member of CheckMates?

High memory usage on 1570/1590