- Local User Groups
We are having performance issues with our primary MDS running R80.20. Like most people that upgraded from R77.30, we knew that R80.20 was more resource intensive. We purchased very large servers for this, since we run our MDS in VMware. We used multiple Cisco UCS servers with 96 cores, 1.5T Ram, a lot of SSD drives for each UCS server. We then installed VMware and built only one guest for now, the MDS. We used 48cores, 768G ram and assigned 8TB of storage for each MDS VM. All three of our virtual MDS servers are built to these specs on different UCS servers in different data-centers synced. And we just added MLM servers to offload logging.
We expected this to be superfast. These VM servers doubled the specs of CheckPoint’s largest platform, the 5150. We only have 61 domains on the primary MDS and about 130 firewalls pointed to it. Every time we turned on Firemon it would bring the MDS down and all the consoles would crash. CheckPoint support then said that we had too much logging and this was using up too much CPU. So we just installed MLM’s in the past three weeks and offloaded the logging. So now logging comes up way quicker in the SmartConsole. So this got faster. And we noticed the load value dropped from around 20 to about 8. Even with the drop in load it is not significantly faster. Consoles still take a while to load and view policy. When someone reassigns the global policy this can take more than an hour and every console gets extremely slow, It get worse if we turn on our tools.
Whenever we turn on our Firemon collector it causes the load value to go from 8 to around 30. At this point all of our consoles start dropping out. We then have shut off Firemon and restart the MDS because the solr process is locked. So we can no longer use Firemon.
We verified the tuning parameters from CheckPoint’ VM tuning guide.
The profile that the server is choosing is:
CHOSEN_CPSETUP_PROFILE="131072 or larger without SME"
In the Security Management – Performance tuning guide it mentions these values:
Ours shows these set pretty low. We have way more resources that we could allocate to these settings.
Does anyone know what these are set to in a 5150 with 256g RAM? I was thinking about doubling or quadrupling these memory values. Maybe our server resources are not being detected and allocated properly.
I am hoping that someone else out there may have some insight or has gone through this. Any help is greatly appreciated.
First off can we assume that this was a fresh-loaded R80.20 SMS with the configuration imported? I'm asking because if it was an upgrade in-place you'll still be using the older ext3 file system. I assume running mount shows that the filesystem type in use for all partitions is XFS.
It looks like your MDS has properly selected the highest/largest possible performance profile which sets a variety of settings such as Java heap sizes. Increasing the Java heap sizes manually should not be attempted without consulting TAC. However you can get an idea if upping the Java heap sizes will help by running the top command then hit SHIFT-H to view threads. Start your Firemon collectors or whatever is causing the slowdowns and watch the top output, if the GC (Garbage Collection) Slave thread is consuming a ridiculous amount of CPU upping the Java heap sizes will definitely help. You can read more about this technique here: sk123417: High CPU utilization for "Java" process in R80.x Management server Based on your report of the SmartConsole disconnections, the main cpm process is probably the one having problems.
Next step is to see if the bottleneck is your disk path, are you seeing a high wio percentage being reported by top during a slow period? If so processes are getting constantly blocked waiting for the disk path. While you are looking at top, any chance the reported steal (st) percentage is nonzero? If it is that means the MDS wants to access a CPU, but the Hypervisor is denying the request which will obviously hinder performance.
Log indexing is running with minimum disk and CPU priority (nice) but can still cause some contention on a heavily saturated disk path. Might be interesting to temporarily stop all MLMs and/or block the reception (and therefore indexing) of incoming firewall logs for a brief test period on the MDS, and see if MDS management performance/stability shoots up even when Firemon is active.
Normally the third element to look at is memory, but it sounds like you have more than enough of that. 🙂
One other thing to check is for frame loss at the NIC level (netstat -ni) such as RX-DRPs which can easily happen in a VM environment due to cycle stealing, if substantial frame loss is present that will cause lots of retransmissions and just increase the load on the MDS that much further. Obviously a 10Gbit interface or bonding up several 1Gbps interfaces would help in this case.
Thank you for the quick reply.
Yes this was a clean install on R80.20 with an import. We had CheckPoint onsite and on the phone with us as we did the migration from R77.30 to R80.20.
As for the the output from top... we have always seen 0.0 wa and 0.0 st. We have 16 SSD drives and have never seen the wait states for IO go up, They have always been zero.
We are running 2x 10G interfaces and the "netstat -ni" is clean.
I will have turn the Firemon back on and monitor the java heap.
Thank you for the info
No HA at this time. We removed all the HA domains before the migration so that it would simplify the migration to R80.20. At some point we do intend to put them back.
MDS#1 has 61 domains, MDS#2 has 9 domains, and MDS#3 has 16 domains. All of them are synced and using global policies. The reason we have three is that we are in the process of moving to a new IP space and we built MDS#3 and we are migrating all the domains from MDS#2 to MDS#3. Once that is done and we will be back to just two MDS servers. Then we were going to start making the the domains HA again.
Right now only MDS#1 with the 61 domains is the slow one. As soon as we enable the Firemon collector in this environment, it basically DDoS our MDS#1. It starts collecting data from all 61 domains and we see the load increase and then the MDS becomes unusable. The other Firemon collector is gathering data from MDS#2 and MDS#3 and this is fine. We were thinking this is because the load is split between the two MDS servers. Where as MDS#1 has 61 domains.
The file system is xfs
OK thanks for the updates, sounds like the Gaia system/network/virtualization environment has plenty of resources and is operating well; always good to establish that before digging deeper.
When you start Firemon on MDS #1 and it causes the SmartConsole instances to timeout and disconnect, do SmartConsole instances connected to the MDS/Global domains, AND SmartConsole instances connected directly to a Domain Management Server (CMA) all fail at the same time? In other words, it doesn't matter where the SmartConsole is connected to that MDS they all fail at once. This is an important clue.
Assuming the answer is yes, you need to be looking at the cpm process and its link to the postgres database. There is only one cpm process (which is highly multithreaded) on an MDS and it handles connections for all the SmartConsoles (and a LOT of other things). One thing that Firemon will probably do is create a crapload of database calls through cpm. I'm wondering if the size of the configuration combined with Firemon's call rate is overloading the connections between cpm and postgres. Think request queue/buffers overflowing, not enough free file descriptors, that kind of thing. cpm maintains a connection to multiple postgres processes (3 concurrent ones on an SMS - use ps -efw | grep "postgres cpm" to check), not sure if more of these connections can be added, kind of like adding more workers on an Apache web server.
First off take a look /var/log/dump/usermode. Any core dumps in there, especially for cpm or postgres?
Next take a look at the following log files, any interesting errors during the slow periods involving resource shortages?
$FWDIR/log/cpm.elg (Java heap usage stats are also reported in the file)
Finally if you run top during the slowdown periods, are cpm and the "postgres cpm" processes using high amounts of CPU around 100% (or beyond)? Or do they look somewhat idle? The latter will tend to occur when a resource limit is being hit.
You hit the nail on the head with the first question. Basically MDS#1 is in our primary data canter and this is the one that every console connects to and for the global domain as well. The MDS#2 & MDS#3 are in a different data center.
When Firemon starts we lose all consoles connected to the CMA'a as well as anyone connected to the global domain. No new console can connect. I we close our consoles you can't reconnect. Everyone gets disconnected.
We do the the "top" when Firemon is running we see several "postgres cpm" running and they go to 100%. They are constantly popping in and out at 100%.
Hmm interesting, it could still be that the heap sizes just need to be increased for various processes including cpm (which would be indicated by very busy GC Slave threads in top) and that would probably be your first course of action in conjunction with TAC.
If you look in the $CPDIR/conf/CpSetupInfo_resourceProfiles.conf file under the "131072 or larger without SME" profile you are using, the following postgres-related variables caught my attention:
NGM_CPM_POSTGRES_SHARED_BUFFERS = 500m-1536m
NGM_CPM_POSTGRES_MAINTENANCE_WORK_MEM = 256m
NGM_CPM_POSTGRES_WORK_MEM = 128m
NGM_CPM_POSTGRES_TEMP_BUFFERS = 128m
NGM_CPM_POSTGRES_EFFECTIVE_CACHE_SIZE = 10% of total RAM, minimum 1m
NGM_CPM_POSTGRES_MAX_CONNECTIONS = 200
NGM_CPM_POSTGRES_CHECKPOINT_SEGMENTS (16) [Looks like the number of postgres cpm "workers"]
Notice that only the variable NGM_CPM_POSTGRES_EFFECTIVE_CACHE_SIZE truly scales as a percentage of the total amount of RAM memory on the system (which is quite high in your case), while all the rest are fixed values/ranges for that auto-selected resource profile. Please note that you should never manually modify these variables or heap sizes without working with TAC. Some of these values need to be increased in tandem or as specific multiples of each other. Failure to do so properly can make things worse (or maybe even MUCH worse).
These variables may be good discussion-starters with TAC for your specific case. Also see my post in this thread laying out how resource profiles are selected on an SMS/MDS: https://community.checkpoint.com/t5/Multi-Domain-Management/Management-server-slowness-in-R80-10/m-p...
We do have a TAC case opened and it is in their escalations department. We requested that these values be reviewed and we are now waiting on a response back to see what values we can change. I am hoping that the escalation team and R&D had a change to review this morning. We have enough resources that we could multiply all these by a factor of 6 and keep the ratios. Or at lease double each of these values as a test and see what happens. I won't do anything until TAC responds.
As soon as we here back from TAC, I will post an update.
Thank you for the insight and the information. I am going read the document you linked as soon as I hit submit on this post.
I have been working the TAC/CFG team and it looks like the problem was with "NGM_QUERY_LIVENESS_OPTIMIZER". This is what updates the SmartConsole screens and does the realtime refresh. Basically Firemon was opening many sessions and each sessions would spawn the "liveness optimizer" and this would then cause the CPU to max out. So once we turned this feature off in the "/opt/CPshrd-R80.20/tmp/.CPprofile.sh" and reloaded the environment vars this did the trick.
add this line to the end:
NGM_QUERY_LIVENESS_OPTIMIZER=1; export NGM_QUERY_LIVENESS_OPTIMIZER;
Reload the environment variables:
$MDS_FWDIR/scripts/reload_env_vars.sh -e "NGM_QUERY_LIVENESS_OPTIMIZER=1"
Also we did increase these values since our server has a lot of resources.
NGM_CPM_MAX_HEAP = 65536m (currently 32768m)
NGM_CPM_SOLR_XMX = 32768m (currently 8192m)
RFL_RFL_MAX_HEAP = 2048m (currently 1024m)
SMARTVIEW_MAX_HEAP = 4096m (currently 2048m)
At this point the MDS is much better, responsive and the consoles are no longer disconnecting.
Thank you for the help.
Great, thanks for the update. So it sounds like Firemon is actually simulating a SmartConsole connection on port 19009 and not going in through the web-based API interface which surprises me a little bit.
You said that you disabled the "liveness optimizer" but you set the variable in the script to 1 (usually means enabled) instead of 0 (usually means disabled)? Seems a little backwards, do you mean you enabled the liveness optimizer?
It does seem backward but this setting had to be set to "1" to disable. After this was done, you could each query come in and this optimizer then displayed as disabled.
I need to correct the previous statement. This did not actually disable the feature.
"There is an internal live query mechanism happening on the management server which we have set it query at an optimized rate by setting NGM_QUERY_LIVENESS_OPTIMIZER to 1"
Also...apparently this is a new feature that was added in a previous HFA.
It has been a week and the FireMon is working now. However, we have had some annoying issues since the "liveness optimizer" has be enabled. Now what happens is that the SmartConsole no longer updates status for events. We were aware that this was going to happen when we turned it off. But the issue is with the Monitor in the console. Whenever you select the cluster object and go to monitor, that status will not update unless you exit the screen and come back. This makes it more difficult when you are performing and upgraded, or restarting a service, or just doing something to the gateway and you need to know if everything is good. So we would like this "liveness" feature back to the way it was. I don't understand why CheckPoint can't detect the difference between a console versus a third party or just API call. It would seem that CheckPoint should keep the "liveness" for consoles only. So this brings me to my question...
Has anyone else ran into this issue yet and if so have you found another way to resolve?
I think this stems from the fact that FireMon is simulating a SmartConsole on port 19009 instead of going in through the web API which as I mentioned earlier surprised me a bit. Ideally there would be some kind of flag that the SmartConsole/FireMon could set to indicate on port 19009 if it is a "live" connection or not, as opposed to API calls which would always be treated as "not live".
In the FireMon product is there any way to cap or limit the number of calls per some period of time? I don't think the SMS/MDS itself has this capability.
How many individual sessions is the FireMon product racking up under Manage & Settings...Sessions? It could be constantly performing publish operations, which have to be propagated to all live SmartConsole clients, which if happening constantly could really slow everyone down. Might be interesting to give the FireMon product SmartConsole credentials that are read-only (auditor) instead of full read/write (thus preventing any publish operations) and turn the liveness optimizer back off, the FireMon client would probably complain but it would be an interesting experiment.
We confirmed and see FireMon using 443 for the API calls not 19009. So it appears that CP is spawning the liveness each time they see a login session not checking if it is an actual console or not.
OK that makes a lot more sense now, thanks for the update...
Hi all ,
I'm Ran and I'm a manager in the R&D of Check Point, responsible for I/S in the Management Server.
I would like to clarify the live query issue:
We have an internal “live query” mechanism that makes sure that some views and objects will be updated without the need to refresh them in case a relevant change was done in another session.
For example, if admin1 views the ‘sessions view’ and another admin logs out from his session, we update the ‘sessions view’ for admin1 automatically.
Recently, we have noticed that our mechanism is over sensitive in some cases and creates high load on the server. For example, when there are many admins connected simultaneously to Smart Console and many session operations are performed constantly. (login \ logout \ publish \ discard)
I don’t know how many admins work in this environment simultaneously but according to the above I understand that there were many session operations (login\logout) performed by Firemon which may be the reason for the increase of load when Firemon is enabled (this is an assumption, I can’t say for sure without live investigation).
The NGM_QUERY_LIVENESS_OPTIMIZER is a temporary solution which turns off the live mechanism for the ‘sessions view’ and ‘gateways view’.
We are actively working on a full and solid solution to make our live mechanism work as expected, without overloading the server. Once this fix is ready and delivered to Jumbo, I will update this thread.
Also, we have engaged with the Firemon team and we’re working together to make sure they use the APIs of the Management efficiently.
With all regarding the profiles:
@M_Ruszkowski , I understand that you changed the default values together with applying the NGM_QUERY_LIVENESS_OPTIMIZER solution.