Re: Management server slowness in R80.10 - Page 2

Chammi_Kumarap1 · ‎2017-09-29

I migrated an FWSM firewall to Checkpoint. The management server is on R80.10 and the gateways are running on R77.30. The FWSM had a large number of rules and objects. Post migration, we have 7000 rules and 5000 objects in the dashboard.

I keep running to a problem where java process hogs up all the available CPU and I'm unable to do anything at this point. The dashboard stops responding and closes. When trying to reconnect, I keep getting an Operation Timeout error. After some time (around 15 minutes), java process consumption eventually goes down and only after that I'm able to re login.

We are in the process of cleaning up the rulebase but can't do that either because of this issue. Troubleshooting becomes a nightmare. The management server runs on a VM with 16GB RAM and 16 CPU cores. The java process consumption goes up as high as 1500%. Tried to get assistance from TAC but they took the easy way out by saying it's a problem with the number of rules and objects. But surely, 16GB RAM and 16 CPU cores should be able to handle this.

Any assistance to sort this out would be much appreciated. The JHF take installed is 35.

Nico_V · ‎2018-05-18

Ok, problem solved for us, but the solution is probably useless to most of you reading due to the origin of the issue and the definitely not-recommended deployment:
We were essentially testing the management server in the gns3 vm, which is virtualized via esxi. So, gaia was running as a nested vm.

We have now created a dedicated vm on the same host and it works wonderfully (the installer even recognized the virtualization technology - vmware - right away, showing the vmware logo).

Thanks for all the pointers, I learned a lot in the process.

Sebastian_Gxxx · ‎2018-09-11

Hi,

I don't know if I did interpret the ps values correctly and if there may be an issue with memory allocation.

This is a MDS ; memory_allocation (4096m) is set.

We experience heavy issues logging in (admins cannot login) to the MDS or domains.

We have some java processes, which consume a lot of CPUs.

Some CPUs are utilized up to 95% for some seconds from time to time and afterwards going back to 30-40%.

Could you please give me a hint about the memory utilization of the Java processes ?

It seems like Virtual Set Size is at 5,8 GB and Resident Set Size is at 1,5 GB.

As we have set memory_allocation to 4 GB (4096) this should be OK, or have I misinterpreted something ?

Is the memory_allocation shared between the java processes ?

How's about the 32 bit java process ?

What's about all the other java processes and their memory allocation ?

[Expert@MDS-R80.10:0]# ps -aux --sort -c
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
admin    25868 154 1.1 5871712 1515484 ?     Ssl 08:57 694:49 /opt/CPshrd-R80/jre_64/bin/java -D_CPM=TRUE -Xaot:forceaot -Xmx4096m -Xms192m -Xgcpol
admin     5719 152 0.8 8146888 1128252 ?     Sl   08:59 681:48 /opt/CPshrd-R80/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx2048m -Xms64m -Xgcpolicy:optavgp
admin    26016 54.0 0.5 1631208 754920 ?      Ssl 08:57 243:25 /opt/CPshrd-R80/jre_64/bin/java -D_smartview=TRUE -Xdump:directory=/var/log/dump/user
admin    14364 45.6 0.3 459236 408216 ?       Ssl 09:04 202:32 /opt/CPshrd-R80/jre_32/bin/java -Xmx256m -Xms128m -Xshareclasses:none -Dfile.encoding
admin    25920 45.1 2.7 51262804 3668900 ?    SNsl 08:57 203:20 /opt/CPshrd-R80/jre_64/bin/java -D_solr=TRUE -Xdump:directory=/var/log/dump/usermode

top - 17:12:12 up 18 days, 7:57, 1 user, load average: 13.07, 12.56, 13.46
Tasks: 610 total, 3 running, 605 sleeping, 0 stopped, 2 zombie
Cpu(s): 19.8%us, 1.3%sy, 1.5%ni, 76.7%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 131868088k total, 83197084k used, 48671004k free, 2050804k buffers
Swap: 67103496k total, 28332k used, 67075164k free, 44201004k cached

Timothy_Hall · ‎2018-09-11

High memory utilization by java processes is not necessarily indicative of a problem. Based on your top output you have 128GB of RAM, 83GB of which is being used for code execution and 45GB is used for disk buffering/caching. Swap utilization is negligible which means that all code is fully executing in RAM and not being slowed down by paging/swapping.

I'm not seeing a problem memory-wise, how many cores does this box have? During the login problems, you could be running short of available CPU slices (less likely), or hitting some kind of heavy disk I/O contention (more likely). The latter would be indicated by high wa values shown by top while the issue is occurring.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Sebastian_Gxxx · ‎2018-09-13

Hello Timothy,

we have 24 CPUs and from time to time some of them get wait states just for a second. I had to wait for some 10-30 seconds to get the following output. The next top refresh shows already 0.x % wait states for this CPUs.

top - 19:47:49 up 19 days, 10:33, 1 user, load average: 16.41, 18.07, 18.15

Tasks: 726 total, 10 running, 714 sleeping, 0 stopped, 2 zombie

Cpu0 : 51.1%us, 8.5%sy, 0.0%ni, 23.1%id, 11.1%wa, 0.0%hi, 6.2%si, 0.0%st

Cpu1 : 31.2%us, 5.5%sy, 1.0%ni, 62.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu2 : 29.2%us, 6.2%sy, 1.3%ni, 63.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu3 : 28.2%us, 5.5%sy, 5.8%ni, 60.2%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu4 : 51.1%us, 10.4%sy, 0.0%ni, 31.9%id, 6.5%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu5 : 81.6%us, 8.4%sy, 0.0%ni, 10.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu6 : 51.8%us, 4.9%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu7 : 29.1%us, 3.9%sy, 4.6%ni, 62.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu8 : 44.8%us, 3.6%sy, 0.3%ni, 50.6%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu9 : 28.7%us, 3.6%sy, 0.7%ni, 66.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu10 : 32.0%us, 5.5%sy, 1.9%ni, 60.2%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu11 : 62.0%us, 6.8%sy, 2.3%ni, 28.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu12 : 42.7%us, 5.8%sy, 0.6%ni, 35.0%id, 15.5%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu13 : 43.4%us, 9.7%sy, 0.0%ni, 46.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu14 : 46.4%us, 5.5%sy, 0.6%ni, 45.8%id, 1.6%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu15 : 31.6%us, 5.2%sy, 4.9%ni, 58.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu16 : 36.9%us, 4.5%sy, 0.3%ni, 57.3%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu17 : 51.5%us, 27.2%sy, 0.3%ni, 20.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu18 : 43.3%us, 5.9%sy, 1.6%ni, 48.9%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu19 : 59.4%us, 5.2%sy, 1.3%ni, 33.1%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu20 : 71.8%us, 6.5%sy, 0.6%ni, 19.7%id, 1.0%wa, 0.0%hi, 0.3%si, 0.0%st

Cpu21 : 32.8%us, 8.1%sy, 0.6%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu22 : 30.4%us, 3.6%sy, 0.3%ni, 38.2%id, 27.5%wa, 0.0%hi, 0.0%si, 0.0%st

Cpu23 : 45.8%us, 13.3%sy, 0.3%ni, 40.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 131868088k total, 130985980k used, 882108k free, 3141268k buffers

Swap: 67103496k total, 28332k used, 67075164k free, 64537056k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

25868 admin 18 0 5732m 1.4g 12m S 199 1.1 2407:30 java

5719 admin 21 0 7896m 1.3g 450m S 104 1.0 2058:32 java

4379 admin 25 0 1040m 707m 48m R 97 0.5 226:22.75 fwm

24215 admin 25 0 865m 533m 49m R 94 0.4 197:02.74 fwm

13797 cp_postg 18 0 1667m 1.6g 1.5g R 88 1.3 104:00.13 postgres

13776 cp_postg 18 0 1606m 1.5g 1.5g R 85 1.2 618:29.11 postgres

24246 admin 18 0 680m 357m 46m R 62 0.3 120:20.54 fwm

26016 admin 18 0 1585m 734m 13m S 35 0.6 1152:20 java

4595 admin 16 0 890m 566m 48m R 34 0.4 259:16.34 fwm

4717 admin 15 0 709m 395m 47m S 33 0.3 141:22.46 fwm

24299 admin 15 0 911m 595m 48m S 29 0.5 195:31.70 fwm

Here some I/O statistics (taken every second):

(the first output listed here is not the first output of that command - so you can trust the values 🙂 )

avg-cpu: %user %nice %system %iowait %steal %idle

29.86 0.71 2.29 0.29 0.00 66.85

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 779.21 0.00 732.67 0.00 12095.05 16.51 0.28 0.38 0.07 4.95

sdb 0.00 0.00 0.00 891.09 0.00 23706.93 26.60 0.92 1.04 0.06 5.25

dm-0 0.00 0.00 0.00 0.00 0.00 12380.20 0.00 0.46 0.00 0.00 5.15

dm-1 0.00 0.00 0.00 2927.72 0.00 23421.78 8.00 5.34 1.83 0.02 5.54

avg-cpu: %user %nice %system %iowait %steal %idle

22.48 1.46 3.75 0.25 0.00 72.06

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-0 0.00 0.00 1.00 0.00 16.00 7368.00 7384.00 0.28 282.00 39.00 3.90

dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

30.13 0.67 5.49 0.17 0.00 63.55

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sdb 0.00 0.00 0.00 807.92 0.00 13306.93 16.47 0.68 0.85 0.05 3.76

dm-0 0.00 0.00 0.00 0.00 0.00 13306.93 0.00 1.16 0.00 0.00 3.76

dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

29.67 2.46 3.87 0.33 0.00 63.67

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 1455.45 0.00 636.63 0.00 16736.63 26.29 2.19 3.43 0.04 2.67

sdb 0.00 0.00 0.00 629.70 0.00 14217.82 22.58 0.95 1.50 0.08 4.95

dm-0 0.00 0.00 0.00 0.00 0.00 10336.63 0.00 0.60 0.00 0.00 4.75

dm-1 0.00 0.00 0.00 2577.23 0.00 20617.82 8.00 9.47 3.67 0.01 2.67

Ivan_Moore · ‎2018-09-25

I would say that now that we have been running R80.10 since April I would go back to R77 in a heartbeat if I could. R80.10 is the biggest hunk of junk Check Point might have ever put out. Performance has gotten worse and worse on our system despite working with R&D for months. Throughout the day folks can't even log into Smart Console most of the time. Policy pushes, global assignments, etc.. fail without error...they just fail.

I spent nearly 60 hours upgrading our Provider-1 environment to 80.10...twice...because they said it would take 10 hours to upgrade our primary the first time and at 24 hours in I had to make the decision to back out once it finished due to not having enough time in our maintenance window. I won't get that time back and the results make me just hate the time I spent.

And we have to get everything upgraded to R80.10 by May??? I can't trust 80.x after this experience and will recommend to anyone to avoid it like the plague as it's not ready.

Kaspars_Zibarts · ‎2018-09-25

Hi Ivan! Sorry to hear about your experience with R80.10, does not sound good at all. I just got curious as for some of us it's been relatively "painless" having only minor bumps, would you mind sharing rough numbers from your environment? So that others that are facing upgrade or not question have better judgement.

I can happily provide ours:

Parameter	Number
Domain count	<30
Gateway count	<50
Total rule count	<20000
CPU cores (R80.10/R77.30)	16/8
RAM GB	124/16
Number of admins	<30
Separate log server	yes
Managing VSX	yes

I'm just guessing that yours is considerably bigger. Maarten Sjouw‌ would have some comments here I'm sure

Sebastian_Gxxx · ‎2018-09-26

Hello Ivan,

regarding your login issues:

please make sure you have the latest SmartConsole R80.10 version installed.

This may help...

Kaspars_Zibarts · ‎2018-09-26

I would disagree, last two releases (56 and 73) made it worse. For us and few others that reported it here

Ivan_Moore · ‎2018-09-25

Parameter	Number
Domain count	42
Gateway count	284
Total rule count	>40000
CPU cores (R80.10/R77.30)	16
RAM GB	197G
Number of admins	>30
Separate log server	yes
Managing VSX	yes

We also have our Management servers in different regions. Primary MDM in US, Secondary MDM in Germany. Each have local MLM's (2 each region). We have a lot of admin's configured but at any point in time there is probably no more than 10 connected.

We use Tufin for policy management so a ton of API calls...and pretty constant during the day.

But kick off a couple policy installs and things crawl. Tried to create a new Domain and had issues with the CMA being created...then it got stuck and didn't clean up after itself properly. Didn't know this until we restarted the MDM a few days later and it wouldn't start. I only found the issue after running cpm doctor and seeing the errors that way. Still waiting on how to fix some of those issues found in that report months later.

Ivan

Kaspars_Zibarts · ‎2018-09-26

I think you might have hit nail on the head! We had a lot of slowness connected to Tufin activity (took a while to work out that). So make sure that you are running the latest take 142 and have 64bit JRE with extended memory from deafult 256MB/32bit setup (https://community.checkpoint.com/thread/9495-api-dying-on-mds-take-142-every-few-days ). Or should I say hopefully CP has already worked that out for you

Ivan_Moore · ‎2018-09-26

we are on 121 right now. Been looking at 142. Will probably jump to that once our other fixes are ported.

Kaspars_Zibarts · ‎2018-09-26

And one more thing that you hopefully knew already - client to MDS latency makes huge difference these days. We had admins in Brazil that basically were not able to use primary MDS in Europe until we provided virtual terminal server for them here in Europe. Latency was around 180ms, impossible to work. Payback for having multiple concurrent admins in RW mode

Ivan_Moore · ‎2018-09-26

I also know there are MDS HA performance issues which I think may also be something we are running into.

Maarten_Sjouw · ‎2018-09-26

For R80.10:

Parameter	Number
Domain count	23
Gateway count	85
Total rule count	<10000
CPU cores R80.10	8
RAM GB	96G
Number of admins	>60
Separate log server	no, max retention 28 days (scripted)
Managing VSX	yes

For R77.30:

Parameter	Number
Domain count	130
Gateway count	505
Total rule count	<20000
Number of MDS	3
CPU cores per MDS	8
RAM GB	96G
Number of admins	>80
Separate log server	no, max retention 28 days (scripted)
Managing VSX	yes

We are not using tufin, we do use admin terminal servers for our internal users which are close to the MDS in network sense of things.

Our customers with Read Only access do not have problems we have heard with slowness.

The R80.10 MDS was built from scratch and as soon as R80.20 comes out we will wait a couple of weeks before we will do an upgrade to R80.20 and start the planning for the R77.30 setup to be migrated.

Regards, Maarten

Are you a member of CheckMates?

Management server slowness in R80.10