- CheckMates
- :
- Products
- :
- Quantum
- :
- Management
- :
- Re: Management server slowness in R80.10
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Management server slowness in R80.10
I migrated an FWSM firewall to Checkpoint. The management server is on R80.10 and the gateways are running on R77.30. The FWSM had a large number of rules and objects. Post migration, we have 7000 rules and 5000 objects in the dashboard.
I keep running to a problem where java process hogs up all the available CPU and I'm unable to do anything at this point. The dashboard stops responding and closes. When trying to reconnect, I keep getting an Operation Timeout error. After some time (around 15 minutes), java process consumption eventually goes down and only after that I'm able to re login.
We are in the process of cleaning up the rulebase but can't do that either because of this issue. Troubleshooting becomes a nightmare. The management server runs on a VM with 16GB RAM and 16 CPU cores. The java process consumption goes up as high as 1500%. Tried to get assistance from TAC but they took the easy way out by saying it's a problem with the number of rules and objects. But surely, 16GB RAM and 16 CPU cores should be able to handle this.
Any assistance to sort this out would be much appreciated. The JHF take installed is 35.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, problem solved for us, but the solution is probably useless to most of you reading due to the origin of the issue and the definitely not-recommended deployment:
We were essentially testing the management server in the gns3 vm, which is virtualized via esxi. So, gaia was running as a nested vm.
We have now created a dedicated vm on the same host and it works wonderfully (the installer even recognized the virtualization technology - vmware - right away, showing the vmware logo).
Thanks for all the pointers, I learned a lot in the process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I don't know if I did interpret the ps values correctly and if there may be an issue with memory allocation.
This is a MDS ; memory_allocation (4096m) is set.
We experience heavy issues logging in (admins cannot login) to the MDS or domains.
We have some java processes, which consume a lot of CPUs.
Some CPUs are utilized up to 95% for some seconds from time to time and afterwards going back to 30-40%.
Could you please give me a hint about the memory utilization of the Java processes ?
It seems like Virtual Set Size is at 5,8 GB and Resident Set Size is at 1,5 GB.
As we have set memory_allocation to 4 GB (4096) this should be OK, or have I misinterpreted something ?
Is the memory_allocation shared between the java processes ?
How's about the 32 bit java process ?
What's about all the other java processes and their memory allocation ?
[Expert@MDS-R80.10:0]# ps -aux --sort -c
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
admin 25868 154 1.1 5871712 1515484 ? Ssl 08:57 694:49 /opt/CPshrd-R80/jre_64/bin/java -D_CPM=TRUE -Xaot:forceaot -Xmx4096m -Xms192m -Xgcpol
admin 5719 152 0.8 8146888 1128252 ? Sl 08:59 681:48 /opt/CPshrd-R80/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx2048m -Xms64m -Xgcpolicy:optavgp
admin 26016 54.0 0.5 1631208 754920 ? Ssl 08:57 243:25 /opt/CPshrd-R80/jre_64/bin/java -D_smartview=TRUE -Xdump:directory=/var/log/dump/user
admin 14364 45.6 0.3 459236 408216 ? Ssl 09:04 202:32 /opt/CPshrd-R80/jre_32/bin/java -Xmx256m -Xms128m -Xshareclasses:none -Dfile.encoding
admin 25920 45.1 2.7 51262804 3668900 ? SNsl 08:57 203:20 /opt/CPshrd-R80/jre_64/bin/java -D_solr=TRUE -Xdump:directory=/var/log/dump/usermode
top - 17:12:12 up 18 days, 7:57, 1 user, load average: 13.07, 12.56, 13.46
Tasks: 610 total, 3 running, 605 sleeping, 0 stopped, 2 zombie
Cpu(s): 19.8%us, 1.3%sy, 1.5%ni, 76.7%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 131868088k total, 83197084k used, 48671004k free, 2050804k buffers
Swap: 67103496k total, 28332k used, 67075164k free, 44201004k cached
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
High memory utilization by java processes is not necessarily indicative of a problem. Based on your top output you have 128GB of RAM, 83GB of which is being used for code execution and 45GB is used for disk buffering/caching. Swap utilization is negligible which means that all code is fully executing in RAM and not being slowed down by paging/swapping.
I'm not seeing a problem memory-wise, how many cores does this box have? During the login problems, you could be running short of available CPU slices (less likely), or hitting some kind of heavy disk I/O contention (more likely). The latter would be indicated by high wa values shown by top while the issue is occurring.
--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Timothy,
we have 24 CPUs and from time to time some of them get wait states just for a second. I had to wait for some 10-30 seconds to get the following output. The next top refresh shows already 0.x % wait states for this CPUs.
top - 19:47:49 up 19 days, 10:33, 1 user, load average: 16.41, 18.07, 18.15
Tasks: 726 total, 10 running, 714 sleeping, 0 stopped, 2 zombie
Cpu0 : 51.1%us, 8.5%sy, 0.0%ni, 23.1%id, 11.1%wa, 0.0%hi, 6.2%si, 0.0%st
Cpu1 : 31.2%us, 5.5%sy, 1.0%ni, 62.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 29.2%us, 6.2%sy, 1.3%ni, 63.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 28.2%us, 5.5%sy, 5.8%ni, 60.2%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu4 : 51.1%us, 10.4%sy, 0.0%ni, 31.9%id, 6.5%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 81.6%us, 8.4%sy, 0.0%ni, 10.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 51.8%us, 4.9%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 29.1%us, 3.9%sy, 4.6%ni, 62.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 44.8%us, 3.6%sy, 0.3%ni, 50.6%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu9 : 28.7%us, 3.6%sy, 0.7%ni, 66.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu10 : 32.0%us, 5.5%sy, 1.9%ni, 60.2%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu11 : 62.0%us, 6.8%sy, 2.3%ni, 28.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu12 : 42.7%us, 5.8%sy, 0.6%ni, 35.0%id, 15.5%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu13 : 43.4%us, 9.7%sy, 0.0%ni, 46.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 46.4%us, 5.5%sy, 0.6%ni, 45.8%id, 1.6%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 31.6%us, 5.2%sy, 4.9%ni, 58.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu16 : 36.9%us, 4.5%sy, 0.3%ni, 57.3%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 51.5%us, 27.2%sy, 0.3%ni, 20.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu18 : 43.3%us, 5.9%sy, 1.6%ni, 48.9%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 59.4%us, 5.2%sy, 1.3%ni, 33.1%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 71.8%us, 6.5%sy, 0.6%ni, 19.7%id, 1.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu21 : 32.8%us, 8.1%sy, 0.6%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 30.4%us, 3.6%sy, 0.3%ni, 38.2%id, 27.5%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 45.8%us, 13.3%sy, 0.3%ni, 40.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 131868088k total, 130985980k used, 882108k free, 3141268k buffers
Swap: 67103496k total, 28332k used, 67075164k free, 64537056k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25868 admin 18 0 5732m 1.4g 12m S 199 1.1 2407:30 java
5719 admin 21 0 7896m 1.3g 450m S 104 1.0 2058:32 java
4379 admin 25 0 1040m 707m 48m R 97 0.5 226:22.75 fwm
24215 admin 25 0 865m 533m 49m R 94 0.4 197:02.74 fwm
13797 cp_postg 18 0 1667m 1.6g 1.5g R 88 1.3 104:00.13 postgres
13776 cp_postg 18 0 1606m 1.5g 1.5g R 85 1.2 618:29.11 postgres
24246 admin 18 0 680m 357m 46m R 62 0.3 120:20.54 fwm
26016 admin 18 0 1585m 734m 13m S 35 0.6 1152:20 java
4595 admin 16 0 890m 566m 48m R 34 0.4 259:16.34 fwm
4717 admin 15 0 709m 395m 47m S 33 0.3 141:22.46 fwm
24299 admin 15 0 911m 595m 48m S 29 0.5 195:31.70 fwm
Here some I/O statistics (taken every second):
(the first output listed here is not the first output of that command - so you can trust the values 🙂 )
avg-cpu: %user %nice %system %iowait %steal %idle
29.86 0.71 2.29 0.29 0.00 66.85
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 779.21 0.00 732.67 0.00 12095.05 16.51 0.28 0.38 0.07 4.95
sdb 0.00 0.00 0.00 891.09 0.00 23706.93 26.60 0.92 1.04 0.06 5.25
dm-0 0.00 0.00 0.00 0.00 0.00 12380.20 0.00 0.46 0.00 0.00 5.15
dm-1 0.00 0.00 0.00 2927.72 0.00 23421.78 8.00 5.34 1.83 0.02 5.54
avg-cpu: %user %nice %system %iowait %steal %idle
22.48 1.46 3.75 0.25 0.00 72.06
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 1.00 0.00 16.00 7368.00 7384.00 0.28 282.00 39.00 3.90
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
30.13 0.67 5.49 0.17 0.00 63.55
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 807.92 0.00 13306.93 16.47 0.68 0.85 0.05 3.76
dm-0 0.00 0.00 0.00 0.00 0.00 13306.93 0.00 1.16 0.00 0.00 3.76
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
29.67 2.46 3.87 0.33 0.00 63.67
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1455.45 0.00 636.63 0.00 16736.63 26.29 2.19 3.43 0.04 2.67
sdb 0.00 0.00 0.00 629.70 0.00 14217.82 22.58 0.95 1.50 0.08 4.95
dm-0 0.00 0.00 0.00 0.00 0.00 10336.63 0.00 0.60 0.00 0.00 4.75
dm-1 0.00 0.00 0.00 2577.23 0.00 20617.82 8.00 9.47 3.67 0.01 2.67
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would say that now that we have been running R80.10 since April I would go back to R77 in a heartbeat if I could. R80.10 is the biggest hunk of junk Check Point might have ever put out. Performance has gotten worse and worse on our system despite working with R&D for months. Throughout the day folks can't even log into Smart Console most of the time. Policy pushes, global assignments, etc.. fail without error...they just fail.
I spent nearly 60 hours upgrading our Provider-1 environment to 80.10...twice...because they said it would take 10 hours to upgrade our primary the first time and at 24 hours in I had to make the decision to back out once it finished due to not having enough time in our maintenance window. I won't get that time back and the results make me just hate the time I spent.
And we have to get everything upgraded to R80.10 by May??? I can't trust 80.x after this experience and will recommend to anyone to avoid it like the plague as it's not ready.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ivan! Sorry to hear about your experience with R80.10, does not sound good at all. I just got curious as for some of us it's been relatively "painless" having only minor bumps, would you mind sharing rough numbers from your environment? So that others that are facing upgrade or not question have better judgement.
I can happily provide ours:
Parameter | Number |
---|---|
Domain count | <30 |
Gateway count | <50 |
Total rule count | <20000 |
CPU cores (R80.10/R77.30) | 16/8 |
RAM GB | 124/16 |
Number of admins | <30 |
Separate log server | yes |
Managing VSX | yes |
I'm just guessing that yours is considerably bigger. Maarten Sjouw would have some comments here I'm sure
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Ivan,
regarding your login issues:
please make sure you have the latest SmartConsole R80.10 version installed.
This may help...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would disagree, last two releases (56 and 73) made it worse. For us and few others that reported it here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Parameter | Number |
---|---|
Domain count | 42 |
Gateway count | 284 |
Total rule count | >40000 |
CPU cores (R80.10/R77.30) | 16 |
RAM GB | 197G |
Number of admins | >30 |
Separate log server | yes |
Managing VSX | yes |
We also have our Management servers in different regions. Primary MDM in US, Secondary MDM in Germany. Each have local MLM's (2 each region). We have a lot of admin's configured but at any point in time there is probably no more than 10 connected.
We use Tufin for policy management so a ton of API calls...and pretty constant during the day.
But kick off a couple policy installs and things crawl. Tried to create a new Domain and had issues with the CMA being created...then it got stuck and didn't clean up after itself properly. Didn't know this until we restarted the MDM a few days later and it wouldn't start. I only found the issue after running cpm doctor and seeing the errors that way. Still waiting on how to fix some of those issues found in that report months later.
Ivan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you might have hit nail on the head! We had a lot of slowness connected to Tufin activity (took a while to work out that). So make sure that you are running the latest take 142 and have 64bit JRE with extended memory from deafult 256MB/32bit setup (https://community.checkpoint.com/thread/9495-api-dying-on-mds-take-142-every-few-days ). Or should I say hopefully CP has already worked that out for you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we are on 121 right now. Been looking at 142. Will probably jump to that once our other fixes are ported.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And one more thing that you hopefully knew already - client to MDS latency makes huge difference these days. We had admins in Brazil that basically were not able to use primary MDS in Europe until we provided virtual terminal server for them here in Europe. Latency was around 180ms, impossible to work. Payback for having multiple concurrent admins in RW mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also know there are MDS HA performance issues which I think may also be something we are running into.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For R80.10:
Parameter | Number |
---|---|
Domain count | 23 |
Gateway count | 85 |
Total rule count | <10000 |
CPU cores R80.10 | 8 |
RAM GB | 96G |
Number of admins | >60 |
Separate log server | no, max retention 28 days (scripted) |
Managing VSX | yes |
For R77.30:
Parameter | Number |
---|---|
Domain count | 130 |
Gateway count | 505 |
Total rule count | <20000 |
Number of MDS | 3 |
CPU cores per MDS | 8 |
RAM GB | 96G |
Number of admins | >80 |
Separate log server | no, max retention 28 days (scripted) |
Managing VSX | yes |
We are not using tufin, we do use admin terminal servers for our internal users which are close to the MDS in network sense of things.
Our customers with Read Only access do not have problems we have heard with slowness.
The R80.10 MDS was built from scratch and as soon as R80.20 comes out we will wait a couple of weeks before we will do an upgrade to R80.20 and start the planning for the R77.30 setup to be migrated.

- « Previous
-
- 1
- 2
- Next »