Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Chammi_Kumarap1
Contributor

Management server slowness in R80.10

I migrated an FWSM firewall to Checkpoint. The management server is on R80.10 and the gateways are running on R77.30. The FWSM had a large number of rules and objects. Post migration, we have 7000 rules and 5000 objects in the dashboard.

I keep running to a problem where java process hogs up all the available CPU and I'm unable to do anything at this point. The dashboard stops responding and closes. When trying to reconnect, I keep getting an Operation Timeout error. After some time (around 15 minutes), java process consumption eventually goes down and only after that I'm able to re login.

We are in the process of cleaning up the rulebase but can't do that either because of this issue. Troubleshooting becomes a nightmare. The management server runs on a VM with 16GB RAM and 16 CPU cores. The java process consumption goes up as high as 1500%. Tried to get assistance from TAC but they took the easy way out by saying it's a problem with the number of rules and objects. But surely, 16GB RAM and 16 CPU cores should be able to handle this.

Any assistance to sort this out would be much appreciated. The JHF take installed is 35.

43 Replies
Nico_V
Participant

Ok, problem solved for us, but the solution is probably useless to most of you reading due to the origin of the issue and the definitely not-recommended deployment:
We were essentially testing the management server in the gns3 vm, which is virtualized via esxi. So, gaia was running as a nested vm.

We have now created a dedicated vm on the same host and it works wonderfully (the installer even recognized the virtualization technology - vmware - right away, showing the vmware logo).

Thanks for all the pointers, I learned a lot in the process.

Sebastian_Gxxx
Contributor

Hi,

I don't know if I did interpret the ps values correctly and if there may be an issue with memory allocation.

This is a MDS ;  memory_allocation (4096m) is set.

We experience heavy issues logging in (admins cannot login)  to the MDS or domains.

We have some java processes, which consume a lot of CPUs.

Some CPUs are utilized up to 95% for some seconds from time to time and afterwards going back to 30-40%.

Could you please give me a hint about the memory utilization of the Java processes ?

It seems like Virtual Set Size is at 5,8 GB and Resident Set Size is at 1,5 GB.

As we have set memory_allocation to 4 GB (4096) this should be OK, or have I misinterpreted something ?

Is the memory_allocation shared between the java processes ?

How's about the 32 bit java process ?

What's about all the other java processes and their memory allocation ?

[Expert@MDS-R80.10:0]# ps -aux --sort -c
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
admin    25868  154  1.1 5871712 1515484 ?     Ssl  08:57 694:49 /opt/CPshrd-R80/jre_64/bin/java -D_CPM=TRUE -Xaot:forceaot -Xmx4096m -Xms192m -Xgcpol
admin     5719  152  0.8 8146888 1128252 ?     Sl   08:59 681:48 /opt/CPshrd-R80/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx2048m -Xms64m -Xgcpolicy:optavgp
admin    26016 54.0  0.5 1631208 754920 ?      Ssl  08:57 243:25 /opt/CPshrd-R80/jre_64/bin/java -D_smartview=TRUE -Xdump:directory=/var/log/dump/user
admin    14364 45.6  0.3 459236 408216 ?       Ssl  09:04 202:32 /opt/CPshrd-R80/jre_32/bin/java -Xmx256m -Xms128m -Xshareclasses:none -Dfile.encoding
admin    25920 45.1  2.7 51262804 3668900 ?    SNsl 08:57 203:20 /opt/CPshrd-R80/jre_64/bin/java -D_solr=TRUE -Xdump:directory=/var/log/dump/usermode

top - 17:12:12 up 18 days,  7:57,  1 user,  load average: 13.07, 12.56, 13.46
Tasks: 610 total,   3 running, 605 sleeping,   0 stopped,   2 zombie
Cpu(s): 19.8%us,  1.3%sy,  1.5%ni, 76.7%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  131868088k total, 83197084k used, 48671004k free,  2050804k buffers
Swap: 67103496k total,    28332k used, 67075164k free, 44201004k cached

0 Kudos
Timothy_Hall
Champion
Champion

High memory utilization by java processes is not necessarily indicative of a problem.  Based on your top output you have 128GB of RAM, 83GB of which is being used for code execution and 45GB is used for disk buffering/caching.  Swap utilization is negligible which means that all code is fully executing in RAM and not being slowed down by paging/swapping. 

I'm not seeing a problem memory-wise, how many cores does this box have?  During the login problems, you could be running short of available CPU slices (less likely), or hitting some kind of heavy disk I/O contention (more likely).  The latter would be indicated by high wa values shown by top while the issue is occurring.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Sebastian_Gxxx
Contributor

Hello Timothy,

we have 24 CPUs and from time to time some of them get wait states just for a second. I had to wait for some 10-30 seconds to get the following output. The next top refresh shows already 0.x % wait states for this CPUs.

top - 19:47:49 up 19 days, 10:33,  1 user,  load average: 16.41, 18.07, 18.15

Tasks: 726 total,  10 running, 714 sleeping,   0 stopped,   2 zombie

Cpu0  : 51.1%us,  8.5%sy,  0.0%ni, 23.1%id, 11.1%wa,  0.0%hi,  6.2%si,  0.0%st

Cpu1  : 31.2%us,  5.5%sy,  1.0%ni, 62.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 29.2%us,  6.2%sy,  1.3%ni, 63.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  : 28.2%us,  5.5%sy,  5.8%ni, 60.2%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu4  : 51.1%us, 10.4%sy,  0.0%ni, 31.9%id,  6.5%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  : 81.6%us,  8.4%sy,  0.0%ni, 10.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  : 51.8%us,  4.9%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  : 29.1%us,  3.9%sy,  4.6%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu8  : 44.8%us,  3.6%sy,  0.3%ni, 50.6%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu9  : 28.7%us,  3.6%sy,  0.7%ni, 66.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu10 : 32.0%us,  5.5%sy,  1.9%ni, 60.2%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu11 : 62.0%us,  6.8%sy,  2.3%ni, 28.6%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu12 : 42.7%us,  5.8%sy,  0.6%ni, 35.0%id, 15.5%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu13 : 43.4%us,  9.7%sy,  0.0%ni, 46.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu14 : 46.4%us,  5.5%sy,  0.6%ni, 45.8%id,  1.6%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu15 : 31.6%us,  5.2%sy,  4.9%ni, 58.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu16 : 36.9%us,  4.5%sy,  0.3%ni, 57.3%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu17 : 51.5%us, 27.2%sy,  0.3%ni, 20.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu18 : 43.3%us,  5.9%sy,  1.6%ni, 48.9%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu19 : 59.4%us,  5.2%sy,  1.3%ni, 33.1%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu20 : 71.8%us,  6.5%sy,  0.6%ni, 19.7%id,  1.0%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu21 : 32.8%us,  8.1%sy,  0.6%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu22 : 30.4%us,  3.6%sy,  0.3%ni, 38.2%id, 27.5%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu23 : 45.8%us, 13.3%sy,  0.3%ni, 40.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:  131868088k total, 130985980k used,   882108k free,  3141268k buffers

Swap: 67103496k total,    28332k used, 67075164k free, 64537056k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

25868 admin     18   0 5732m 1.4g  12m S  199  1.1   2407:30 java

 5719 admin     21   0 7896m 1.3g 450m S  104  1.0   2058:32 java

 4379 admin     25   0 1040m 707m  48m R   97  0.5 226:22.75 fwm

24215 admin     25   0  865m 533m  49m R   94  0.4 197:02.74 fwm

13797 cp_postg  18   0 1667m 1.6g 1.5g R   88  1.3 104:00.13 postgres

13776 cp_postg  18   0 1606m 1.5g 1.5g R   85  1.2 618:29.11 postgres

24246 admin     18   0  680m 357m  46m R   62  0.3 120:20.54 fwm

26016 admin     18   0 1585m 734m  13m S   35  0.6   1152:20 java

 4595 admin     16   0  890m 566m  48m R   34  0.4 259:16.34 fwm

 4717 admin     15   0  709m 395m  47m S   33  0.3 141:22.46 fwm

24299 admin     15   0  911m 595m  48m S   29  0.5 195:31.70 fwm

Here some I/O statistics (taken every second):

(the first output listed here is not the first output of that command - so you can trust the values 🙂 )

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          29.86    0.71    2.29    0.29    0.00   66.85

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00   779.21  0.00 732.67     0.00 12095.05    16.51     0.28    0.38   0.07   4.95

sdb               0.00     0.00  0.00 891.09     0.00 23706.93    26.60     0.92    1.04   0.06   5.25

dm-0              0.00     0.00  0.00  0.00     0.00 12380.20     0.00     0.46    0.00   0.00   5.15

dm-1              0.00     0.00  0.00 2927.72     0.00 23421.78     8.00     5.34    1.83   0.02   5.54

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          22.48    1.46    3.75    0.25    0.00   72.06

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

dm-0              0.00     0.00  1.00  0.00    16.00  7368.00  7384.00     0.28  282.00  39.00   3.90

dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          30.13    0.67    5.49    0.17    0.00   63.55

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

sdb               0.00     0.00  0.00 807.92     0.00 13306.93    16.47     0.68    0.85   0.05   3.76

dm-0              0.00     0.00  0.00  0.00     0.00 13306.93     0.00     1.16    0.00   0.00   3.76

dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          29.67    2.46    3.87    0.33    0.00   63.67

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00  1455.45  0.00 636.63     0.00 16736.63    26.29     2.19    3.43   0.04   2.67

sdb               0.00     0.00  0.00 629.70     0.00 14217.82    22.58     0.95    1.50   0.08   4.95

dm-0              0.00     0.00  0.00  0.00     0.00 10336.63     0.00     0.60    0.00   0.00   4.75

dm-1              0.00     0.00  0.00 2577.23     0.00 20617.82     8.00     9.47    3.67   0.01   2.67

0 Kudos
Ivan_Moore
Contributor

I would say that now that we have been running R80.10 since April I would go back to R77 in a heartbeat if I could.  R80.10 is the biggest hunk of junk Check Point might have ever put out.  Performance has gotten worse and worse on our system despite working with R&D for months.    Throughout the day folks can't even log into Smart Console most of the time.  Policy pushes, global assignments, etc.. fail without error...they just fail.  

I spent nearly 60 hours upgrading our Provider-1 environment to 80.10...twice...because they said it would take 10 hours to upgrade our primary the first time and at 24 hours in I had to make the decision to back out once it finished due to not having enough time in our maintenance window.   I won't get that time back and the results make me just hate the time I spent.  

And we have to get everything upgraded to R80.10 by May???   I can't trust 80.x after this experience and will recommend to anyone to avoid it like the plague as it's not ready.  

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Hi Ivan! Sorry to hear about your experience with R80.10, does not sound good at all. I just got curious as for some of us it's been relatively "painless" having only minor bumps, would you mind sharing rough numbers from your environment? So that others that are facing upgrade or not question have better judgement.

I can happily provide ours:

ParameterNumber
Domain count<30
Gateway count<50
Total rule count<20000
CPU cores (R80.10/R77.30)16/8
RAM GB124/16
Number of admins<30
Separate log serveryes
Managing VSXyes

I'm just guessing that yours is considerably bigger. Maarten Sjouw‌ would have some comments here I'm sure Smiley Happy

Sebastian_Gxxx
Contributor

Hello Ivan,

regarding your login issues:

please make sure you have the latest SmartConsole R80.10 version installed.

This may help...

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

I would disagree, last two releases (56 and 73) made it worse. For us and few others that reported it here  

Ivan_Moore
Contributor

ParameterNumber
Domain count42
Gateway count284
Total rule count>40000
CPU cores (R80.10/R77.30)16
RAM GB197G
Number of admins>30 
Separate log serveryes
Managing VSXyes

We also have our Management servers in different regions.  Primary MDM in US, Secondary MDM in Germany.  Each have local MLM's (2 each region).  We have a lot of admin's configured but at any point in time there is probably no more than 10 connected.   

We use Tufin for policy management so a ton of API calls...and pretty constant during the day.  

But kick off a couple policy installs and things crawl.  Tried to create a new Domain and had issues with the CMA being created...then it got stuck and didn't clean up after itself properly.  Didn't know this until we restarted the MDM a few days later and it wouldn't start.  I only found the issue after running cpm doctor and seeing the errors that way.  Still waiting on how to fix some of those issues found in that report months later.  

Ivan

Kaspars_Zibarts
Employee Employee
Employee

I think you might have hit nail on the head! We had a lot of slowness connected to Tufin activity (took a while to work out that). So make sure that you are running the latest take 142 and have 64bit JRE with extended memory from deafult 256MB/32bit setup (https://community.checkpoint.com/thread/9495-api-dying-on-mds-take-142-every-few-days ). Or should I say hopefully CP has already worked that out for you Smiley Happy

0 Kudos
Ivan_Moore
Contributor

we are on 121 right now.  Been looking at 142.  Will probably jump to that once our other fixes are ported.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

And one more thing that you hopefully knew already - client to MDS latency makes huge difference these days. We had admins in Brazil that basically were not able to use primary MDS in Europe until we provided virtual terminal server for them here in Europe. Latency was around 180ms, impossible to work. Payback for having multiple concurrent admins in RW mode Smiley Sad

0 Kudos
Ivan_Moore
Contributor

I also know there are MDS HA performance issues which I think may also be something we are running into.  

Maarten_Sjouw
Champion
Champion

For R80.10:

ParameterNumber
Domain count23
Gateway count85
Total rule count<10000
CPU cores R80.108
RAM GB 96G
Number of admins>60 
Separate log serverno, max retention 28 days (scripted)
Managing VSXyes

For R77.30:

ParameterNumber
Domain count130
Gateway count505
Total rule count

<20000

Number of MDS3
CPU cores per MDS8
RAM GB 96G
Number of admins>80 
Separate log serverno, max retention 28 days (scripted)
Managing VSXyes

We are not using tufin, we do use admin terminal servers for our internal users which are close to the MDS in network sense of things.

Our customers with Read Only access do not have problems we have heard with slowness.

The R80.10 MDS was built from scratch and as soon as R80.20 comes out we will wait a couple of weeks before we will do an upgrade to R80.20 and start the planning for the R77.30 setup to be migrated.

Regards, Maarten

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events