Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Morten_O
Contributor
Contributor
Jump to solution

High memory usage on 1570/1590

Hi,

we have lately hardware-refreshed a lot of 1400-appliance to mainly 1570 and 1590 models.

All are now running R81.10.10 build 996002945

A few times, we have had reports, that the appliances becomes unresponsive (not even answering ping, ssh or webui) and has to be power-cycled to start working again.

So I checked at multiple customers, and I can see, that they are all running with very high memory utilization - above 80%.

All are centrally managed, and I have seen this at multiple customers - so very different policies etc.

One of the customers is not even running IPS, which is known for intense utilization (at least on the 1400-appliances).

Are others seeing the same? Can it be a memoryleak or....?

I already opened a SR (where first recommendation was to upgrade......), but I was just interested in hearing if I'm the only one seeing this picture.

1 Solution

Accepted Solutions
belgur
Explorer

Hi

i have the same problem. TAC report to me the problem is hcp script in management server.
Next week will be release a nev version as can be found in HealthCheck Point (HCP) Release Updates 

View solution in original post

29 Replies
Lesley
Advisor

High memory load could be normal on Linux based systems. Maybe share a top output. You should focus on swap. High swap could be an indication of highly loaded systems.

Tasks: 280 total, 3 running, 175 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.1%us, 7.7%sy, 0.0%ni, 79.8%id, 0.0%wa, 0.2%hi, 1.2%si, 0.0%st
Mem: 8024384k total, 7229184k used, 795200k free, 208448k buffers
Swap: 0k total, 0k used, 0k free, 1125312k cached

Here an example. As you can see swap is 0k and memory itself looks loaded but it is not due low swap.

Second, the way to check if there is a memory leak is to monitor this with a monitoring tool. I use this to see the memory in a graph from longer period of time. If you see a small increase every day for example it is an indication of a memory leak. 

Third, maybe pay some attention to the SFWD daemon. If have see people with problems related to this daemon, that it is restarting and causing issues. 

cat $FWDIR/log/sfwd.elg

cat $FWDIR/log/sfwd.elg | grep SFWD

cat $FWDIR/log/sfwd.elg | grep -i '360 MB' -A 1

cpwd_admin list (check here if the SFWD is restarting

 

4 point maybe try to run a doctor spark on a problematic gateway, maybe you can find something in this health check. 

https://sc1.checkpoint.com/documents/SMB_R81.10.X/AdminGuides_Locally_Managed/EN/Content/Topics/DrSp...

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Morten_O
Contributor
Contributor

Hi,

I searched through a sfwd.elg yesterday, and didn't find anything of interest, and no 360 MB entries, and not watchdog-restarts.

The one I searched is the same as this output is from - it's from a 1570 yesterday, and this morning it had to be power-cycled because it was unresponsive. As you can see, it's not doing much from a connection-perspective, but it sure uses some memory and a little swap.....

[Expert]#free
total used free shared buff/cache available
Mem: 1958900 1756676 72792 54644 129432 79332
Swap: 524284 524284 0

[Expert]# fw ctl pstat

System Capacity Summary:
Memory used: 39% (527 MB out of 1339 MB) - below watermark
Concurrent Connections: 1% (2476 out of 149900) - below watermark
Aggressive Aging is disabled

Hash kernel memory (hmem) statistics:
Total memory allocated: 343760896 bytes in 83926 (4096 bytes) blocks using 3 pools
Initial memory allocated: 276824064 bytes (Hash memory extended by 66936832 bytes)
Memory allocation limit: 629145600 bytes using 512 pools
Total memory bytes used: 0 unused: 343760896 (100.00%) peak: 316881572
Total memory blocks used: 0 unused: 83926 (100%) peak: 80954
Allocations: 3937903068 alloc, 0 failed alloc, 3936443903 free

System kernel memory (smem) statistics:
Total memory bytes used: 647431900 peak: 681652464
Total memory bytes wasted: 5449467
Blocking memory bytes used: 3048884 peak: 3275656
Non-Blocking memory bytes used: 644383016 peak: 678376808
Allocations: 8919614 alloc, 1 failed alloc, 8915235 free, 0 failed free
vmalloc bytes used: 640774312 expensive: no

Kernel memory (kmem) statistics:
Total memory bytes used: 473496748 peak: 604930896
Allocations: 3946820926 alloc, 1 failed alloc
3945359055 free, 0 failed free
External Allocations:
Packets: 286680, SXL: 3917423, Reorder: 0
Zeco: 0, SHMEM: 4320, Resctrl: 0
ADPDRV: 0, PPK_CI: 1272416, PPK_CORR: 0

0 Kudos
Lesley
Advisor

Is this output from BEFORE or AFTER reboot. Because if it is unresponsive I can imagine you are not able to can get this type of output. It is important to get this information before the reboot if there are memory related issues. 

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Morten_O
Contributor
Contributor

Output is from yesterday, where the mem-usage was high - I use the info in the SR. The appliances was unresponsive this morning.

0 Kudos
Taki183
Explorer

We have a similar problem. Model 1590, running R81.10
It started a week ago. Memory usage 90-98%

Output from "free -m"

Click to Expand
total used free shared buff/cache available
Mem: 1816588 1671316 50120 61384 95152 33180
Swap: 0 0 0

Output from "top" sorted by Memory

Click to Expand
20594 root 20 0 409m 201m 16m S 8 11.3 66:05.64 fw
8331 root 20 0 343m 113m 1076 S 0 6.4 0:00.01 fw
24360 root 20 0 332m 101m 1096 S 0 5.7 0:00.01 fw
7077 root 20 0 331m 101m 1096 S 0 5.7 0:00.00 fw
22315 root 20 0 330m 100m 1032 S 0 5.6 0:00.01 fw
5150 root 20 0 329m 99m 1096 S 0 5.6 0:00.01 fw
20462 root 20 0 329m 98m 1096 S 0 5.5 0:00.01 fw
21845 root 20 0 312m 83m 1012 S 0 4.7 0:00.01 fw
26912 root 20 0 303m 74m 1040 S 0 4.2 0:00.01 fw
2891 root 20 0 281m 52m 1012 S 0 3.0 0:00.02 fw
8406 root 20 0 132m 23m 2036 S 0 1.3 2:09.24 cpview_api_serv
4998 root 20 0 23948 21m 2228 S 0 1.2 0:00.07 newSfwsh.bin
8410 root 20 0 144m 19m 2136 S 0 1.1 0:11.16 cpviewd
3787 root 0 -20 45808 14m 3004 S 0 0.8 25:33.82 cposd
5067 root 20 0 32064 12m 2420 S 0 0.7 0:00.08 runCliCommand.l

Logs from "sfwd.elg". I don't understand if there is anything suspicious here

Click to Expand
fwstatagent_check_sdwan: Getting sdwan interfaces
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=0 dis_s=1053044 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=1
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=1 dis_s=1048755 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=2
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
08:57:25.305790 Conversion start: probing_hb.conv
fwobj_get_converted_set_as_string: Conversion failed with error: /usr/local/share/lua/5.1/conversion/convert.lua:0: /usr/local/share/lua/5.1/conversion/convert.lua:0: conversion.convert.handle_match: probingServer in /opt/fw1/conf/probing_hb.conv
fwobj_get_converted_set_as_string: Failed to create result_string
08:57:25.516413 Conversion end : probing_hb.conv
fwobj_get_converted_set: Failed to create fwset instance
ProbingServerStatus::getProbingServerStatus: Error reading the probing servers file.
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces
ringdir_append: ringdir-debug slot=2 dis_s=1049785 is bigger than max_slot_size=1048576
ringdir_append: ringdir-debug move to slot=3
br_names: failed popen
fwstatagent_check_sdwan: Getting sdwan interfaces

Logs from "cpwd.elg"

Click to Expand
[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:36:58] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).
[cpWatchDog 3175 4155615392]@RD6281[18 Jun 12:37:08] [SUCCESS] SFWD started successfully (pid=669)
[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:08:56] [ERROR] Process SFWD terminated abnormally : Unhandled signal 9 (SIGKILL).
[cpWatchDog 3175 4155615392]@RD6281[19 Jun 6:09:06] [SUCCESS] SFWD started successfully (pid=20594)

I suspect the sfwd process but I can’t figure out why it crashes. What exactly causes it to restart?

0 Kudos
andwen
Participant

We have seen this behavior as well, on model 1550 and 1590 (not on 1530 as far as we can see) with firmware-version R81.10.08 and R810.10.10 (both the latest version with the CVE-fix). If you'd check with "ps aux | grep "[f]w sfwd", I am guessing you will see multiple instances of this process with only one running. The total amount of memory (shown in percentages) that these processes are using will probably be only slightly less than the total amount of memory that is in use.

We have started a TAC-case for this issue and R&D is working on it.

0 Kudos
PhoneBoy
Admin
Admin

What does cpwd_admin list say?

0 Kudos
Taki183
Explorer

Nothing interesting I guess

Click to Expand
APP PID STAT #START START_TIME MON COMMAND
RNGD 3196 E 1 [19:54:13] 25/5/2024 N /pfrm2.0/bin/jitterentropy_rngd -v
SSHD 3703 E 1 [19:54:15] 25/5/2024 N /pfrm2.0/bin/sshd -f /pfrm2.0/etc/sshd_config -p 22 -D
cposd 3787 E 1 [19:54:17] 25/5/2024 N cposd
RTDB 3814 E 1 [19:54:17] 25/5/2024 N rtdbd
SFWD 12320 E 1 [20:39:40] 21/6/2024 N fw sfwd
CVIEWAPIS 8406 E 1 [19:54:35] 25/5/2024 N cpview_api_service
CPVIEWD 8410 E 1 [19:54:35] 25/5/2024 N cpviewd
0 Kudos
Pedro_Espindola
Advisor

We have the same behavior in a centrally managed 1570 cluster. This is in a small network with not a lot of traffic. The system was perfectly fine in R80.20.50, took days to reach 85% RAM and didn't go above that unless it was stressed. The problem started right after upgrading to R81.10.X, reaching 95-100% a few minutes after reboot for no reason.

0 Kudos
Marquevis
Participant

Hello,

Run the command below and see which processes are consuming the most memory.

ps -eo size,pid,user,command --sort -size | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' > /home/admin/memory .txt

 

 

0 Kudos
Scott_Cordy
Explorer

Seeing exactly the same thing here on a fleet of 1570R's running R81.10.10. Output from your command above:

You are in expert mode now.

<Mb ",hr) } { for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'
0.00 Mb COMMAND
346.38 Mb fw sfwd
344.23 Mb fw sfwd
342.04 Mb fw sfwd
340.37 Mb fw sfwd
337.66 Mb fw sfwd
335.89 Mb fw sfwd
335.27 Mb fw sfwd
220.74 Mb fw sfwd
218.10 Mb fw sfwd
217.37 Mb fw sfwd
217.32 Mb fw sfwd
216.98 Mb fw sfwd
216.60 Mb fw sfwd
214.54 Mb fw sfwd
210.98 Mb fw sfwd
193.66 Mb fw sfwd
77.47 Mb cpviewd

And gets boring after that. So "fw sfwd" appears to be holding all the mem. Just logged a TAC call.

0 Kudos
Amir_Ayalon
Employee
Employee

Hi All

updating you that after investigation this seems to be a management issue (not related to the new Spark firmware)

we are taking it with MT owners.

Thanks

 

andwen
Participant

Hi Amir,

We did not see this high memory with the R81.10.x-versions that weren't patched. This high memory-usage was first seen after we installed build 1750 and 2945, so I am curious to how this is deemed a management issue.

Other than that, I hope this can be resolved quickly 🙂

0 Kudos
PhoneBoy
Admin
Admin

The underlying issue could be caused by something pushed from the management.
Unless these are units managed only through the local WebUI or Infinity Portal (not with Smart-1 Cloud or on-prem management)?

0 Kudos
Amir_Ayalon
Employee
Employee

exactly,

This is one of the Agents that is being updated Automatically. 

(hopefully tomorrow an update in this Agent will also be pushed Automatically, and the issue will be resolved)

 

0 Kudos
Morten_O
Contributor
Contributor

That sounds great - can you elaborate what the problem is/was, and how to check if the relevant part has been updated tomorrow, or even how to force an update?

0 Kudos
Kdeo
Explorer

Hi @Amir_Ayalon .

Do you have any updates? Can we check if this issue was solved?

Thanks!

0 Kudos
Scott_Cordy
Explorer

Hi @Amir_Ayalon,

2nded! Are there any updates with this? How do we confirm this has been resolved?

Thanks.

0 Kudos
MikeB
Advisor

Could this problem be solved? any update on this?

0 Kudos
GuruHexa
Explorer

Hello, I have the same problem here with a 1570 appliance R80.20 with central management.
Is there a fix or version that solves it?

# ps aux | grep "fw sfwd"
root 5461 0.0 6.7 403608 132564 ? S 12:56 0:00 fw sfwd
root 6013 2.5 8.0 403608 158556 ? Ssl Jun27 32:30 fw sfwd
root 8355 0.0 0.0 4452 812 pts/0 S+ 13:39 0:00 grep fw sfwd
root 11467 0.0 5.7 391764 112660 ? S 06:56 0:00 fw sfwd
root 14986 0.0 5.4 395164 106612 ? S Jun27 0:00 fw sfwd
root 29252 0.0 5.6 399132 111132 ? S 00:56 0:00 fw sfwd

Regards

0 Kudos
andwen
Participant

Yes, there is a fix that needs to be implemented on the management-server. After that, you can perform a "killall fw" (via cprid_util or manually per gateway) to free the memory and this should remain stable. At least in our environment , that seems to be the case.

GuruHexa
Explorer

Thank you very much!
Do you know which fix to implement on the management server?

Thanks

0 Kudos
Pedro_Espindola
Advisor

We had no issues with locally managed gateways. Only centrally managed were affected, seen in R81.10.08, even before the CVE patch.

0 Kudos
RS_Daniel
Advisor

Hello,

Yes, same problem here, we see these multiple SFWD instances on 1500/1600 and 1800 appliancess too, all of them centrally managed. Is there any other way to check if the update/fix was applied? or only with top/ps looking for SFWD?

Regards

belgur
Explorer

Hi

i have the same problem. TAC report to me the problem is hcp script in management server.
Next week will be release a nev version as can be found in HealthCheck Point (HCP) Release Updates 

Pedro_Espindola
Advisor

Apparently there is a new version from June 25th, but no notes about it in the list of resolved issues yet.

0 Kudos
lrossi89
Contributor

The TAC has provided us this HCP-1-592320.I386.RPM version and it seems to be Okay, now we await a few days

Pedro_Espindola
Advisor

Thank you for the information! So it is not take 72 from June 25th, which is build 592042. Let's wait for the new one.

0 Kudos
Morten_O
Contributor
Contributor

It sounds like to root-cause has been found, even though I still haven't seen any of the 15xx's going with less memory-usage, but let's hope this will happen during this week.

I have a question though. HCP is a passive tool, right, which has to be run maunally. The Spark appliances doesn't even support HCP. So how can a HCP on the managementstations cause this issue?

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events