Solved: Re: R80.20 Management Performance

Daniel_Collins · ‎2019-09-06

Hello Check Mates!

I hope you can help perhaps shed some light on an issue we're seeing with one of our customers. The customer is commercially sensitive due to some long-standing issues they've had with a 61k appliance and a recent code upgrade on the system (management at the moment) to R80.20 has degraded performance from the customer's perspective.

What we're seeing is this:

- A slowness in stacking and unstacking the subject headings in the rulebase
- There is around 700 rules with 200 subject headings in the policy
- What we see is you press the button to drop the subject headings and then the wire frames appear for the rules, a few seconds later the rule content pops into the console
- Adding say objects to rules (clicking the *) that there is a good second or few seconds delay until the search box appears.

The management server is on R80.20 with the latest T91 of the JHF installed. Very well specced, 16 cores / 18GB RAM / SSD based flash storage in VMware. The console is being run on a machine with 32 cores and 64GB of RAM, similar storage scenario. We observed the server via SSH while testing these issues and saw no noticable load on the system, use of swap or any %WA on I/O.

From our perspective as a partner, the behaviour we see other than the rule stacking is as we'd expect from an R80.x install of management. I do not have a point of comparison for the rule stacking issue, all of the customers I have worked with as of late (in R80.x days) have significantly smaller rulebases or far fewer subject headings.

The customer was on R77.30 before and has noticed that the server performances significantly worse in R80.20 than it did previously. We can replicate these issues through a database export into a lab server as well as exporting the policy via the python script into a fresh management server, it follows the policy.

There is an element of expectation here, but this customer is commercially sensitive as we will be trying to ensure they continue to replace the 61k's with another Check Point appliance (something that's not SP based) so we're looking to see what we can do in terms of tuning up performance of the management server.

We're not in a position to re-jig the policy (in terms of in-line layers, due to the 61k being on R76SP.50 and consultancy time needed to do so prior to a replacement solution) but the policy is very tidy. Some perhaps duplication but nothing severe.

I've been through the VMware tuning guide on sk104848 and not had any noticeable difference..

Any thoughts?

HeikoAnkenbrand · ‎2019-09-06

Hi @Daniel_Collins

more read here:

R80.10 Management Performance Guide

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

Martin_Valenta · ‎2019-09-06

16 cpu and 18 gb ram only? I would add more RAM definitely and also as long as it's on VM, it might not get enough IO even with SSD, if disk resources are shared with others. On machine where you run SmartConsole you don't need really 32 cpu and 64 gb ram, that will not speed up anything, maybe with r77.x which was client based management, but with r80.x everything is handled by management server.

Timothy_Hall · ‎2019-09-06

What is the network latency between the SmartConsole client system and the SMS? Due to some of the changes mentioned by Dameon in regards to most of the processing happening on the SMS, SmartConsole performance can now be dramatically affected by high network latency in R80+. Make sure that your .NET libraries are up to date on the SmartConsole system (especially if it is an older OS such as Windows 7) and run dxdiag to ensure all hardware-based graphics acceleration is working correctly. Also make sure you are running the latest version of the R80.20 SmartConsole software available here: sk137593: R80.20 SmartConsole Releases

Any chance that the SmartConsole is being run from inside an RDP session? If so make sure Font Smoothing is enabled in the RDP client, it makes a huge difference.

On the SMS side even though the OS doesn't seem to be short of memory, your 18GB of total RAM is used to set the maximum Java heap sizes available to SMS processes such as cpm which is what the SmartConsole GUI is interacting with. Java heap sizes continue to scale upwards until about 35.6GB of RAM, if a process like cpm doesn't have enough heap when working with a large configuration, it can wind up expending more CPU time performing heap garbage collection than actually getting useful work done. Given the size of your configuration you may want to try increasing RAM to 32GB to memory-scale the Java heap sizes higher which can make a big difference to processes such as cpm. Core-based resource scaling tops out at 12+ cores, so your 16-core allocation is perfect.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Chris_Atkinson · ‎2019-09-06

Not much really to add to the above, some great insight shared by all.

If the machine running the console is a VM ensure it has enough video memory allocated. I presume also that the production SMS was upgraded using a method providing the XFS file system?

CCSM R77/R80/ELITE

Tomer_Noy · ‎2019-09-07

A lot of good feedback and tips were given.

I can emphasize two points to check, based on what you wrote:

1) Since the Management is running on a VM, make sure that it has dedicated resources. Sometimes you can allocate memory and CPU to a VM, but if it's shared, then other VMs can take it.

2) Latency can be a significant factor in R80.x compared to R77.x. Since most of the processing is done on the server side, the client may need to make many requests to the server to perform operations. If your latency is towards 200ms (or above), then that could have a big impact.

We have recently done a lot of work to move some of the client-server requests to background threads, thus allowing us to process in parallel and avoid blocking the UI. Most of the work concentrated on the rulebase, which is a complex component with many calculations. If you indeed have high latency, then you can open a ticket and request to get these fixes as a private HF. We are working to get them into the next version and hopefully to future jumbo versions as well.

JozkoMrkvicka · ‎2019-09-08

Is there some recommended setup for VM which is hosting SmartConsole ?

Besides the already mentioned points:
1. Dedicated resources
2. The latency between SmartConsole and Management below 200 ms
3. Enabled Font Smoothing

According to the R80.20 and R80.30 Release Notes these are minimum requirements:

Kind regards,
Jozko Mrkvicka

Timothy_Hall · ‎2019-09-08

In regard to dedicated CPU resources, an easy way to see if performance is being impacted by using virtual (non-dedicated) CPU's in a virtualized environment is to look at the CPU "steal" percentage ("st" in top and "%steal" in sar -u) on the SMS/MDS. A nonzero steal indicates the percentage of time execution on a virtual CPU was blocked by the hypervisor waiting for availability of a "real" CPU. Obviously this is not a desirable condition from a performance perspective, and generally if steal is consistently >20% you should probably look at allocating dedicated CPUs. Steal percentage will always be zero on bare metal (non-virtualized) SMS/MDS hardware.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Daniel_Collins · ‎2019-09-09

Thanks everyone for your feedback, it's really useful.

You mention high-latency, what are you referring to - back-end server latency of C2S latency?

It's worth mentioning that all of the customer's issues are easy for me to replicate on my home lab server, although not as well equipped my C2S latency is around 1ms (as it's only one hop away) and I get all the same GUI and rule loading issues.

As one has mentioned, I genuinely don't think throwing more resources at it is the answer. I have a bit of a VMware background and throwing more resources than sometimes what the underlying system has, just causes more contention issues - that and looking at the system performance it's not wanting for anything.

From our perspective, a management server just being used for policy management (no logging nothing else) and a single administrator with 4 policies and 4 gateways - 8 CPU's and 16GB of RAM should really be sufficient. We are however happy to add more resources, I am just concerned about exacerbating the customer's perception of the products "degrading" performance..

Also the performance issues happen when It's the only VM on my lab box with nothing else running! so no contention issues there (SSD storage too).

The SmartConsole is running on physical hardware, not virtualized in both mine and the customer environment. Although the customer does RDP to the machine - I do not.

Martin_Valenta · ‎2019-09-09

Everybody screamed at first time, when moved from r77.x to r80.x, but it change of architecture is at the end bringing more benefits to all customers.

Daniel_Collins · ‎2019-10-09

Thanks everyone for all your feedback it's been quite helpful.

We think we've made a good start with TAC - they provided us with a "fixed" SmartConsole and some changes to the java heap size for CPM and that has made a significant improvement to the system performance. Mostly from the new console which is a vast improvment.

TAC have confirmed that this *should* be intergrated into the main train of the console soon.. although I am not privvy to the changes made, I can only guess it's some caching/optimizing of the content pulled from the server.

Martin_Valenta · ‎2019-10-09

I would be interested to know what tuning on SmartConsole they did..

Daniel_Collins · ‎2019-10-09

Me too! but they wouldn't disclose what's been changed ☹️

Timothy_Hall · ‎2019-10-09

There are some hints to what was probably changed in this thread:

https://community.checkpoint.com/t5/Policy-Management/Searching-Network-Objects-in-R80-xx-is-cripple...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tomer_Noy · ‎2019-10-10

The optimizations were mainly in three areas:

1) Specific optimizations in groups with thousands of objects

2) Identifying UI requests to the server that were blocking the main processing thread, and moving them to background threads. Especially in cases of high latency, or slow responses from the server, this yielded significant improvements. The positive impact is due to the fact that the user doesn't "feel" that his UI is blocked and waiting, plus we are able to make many requests in parallel (instead of sequentially).

3) Additional general optimizations that were found in the investigation (lower impact)

Increasing caching was evaluated, but it didn't provide a significant improvement. Also, there are downsides to "over caching" since we need to make sure that objects are up-to-date and this involves extra notifications and monitoring for updates.

Kudos to Amir Jaron, @Nurit_Gr and their developers that implemented this.

The improvements are in R80.40, so anyone who wants to get them via the EA is more than welcome to join.
We also plan to integrate them to later JHFs.

Daniel_Collins · ‎2019-10-10

Thanks a lot for your feedback, it's much appreciated.

So for clarity this will be factored into future releases of major versions of Check Point rather than new builds of the console for older versions? Just concerned the customer might upgrade their console version (because of a new JHF) and these performance improvements aren't there...

Tomer_Noy · ‎2019-10-13

We plan to integrate the improvements into future JHFs as well. I can’t give a specific date on that...

If you are running with private fixes, it’s always recommended to look at the JHF / SmartConsole build SK to verify that your fixes are included before updating.

Server side JHFs have a mechanism to warn you if you’re about to lose a fix, but the SmartConsole is a full replacement, so there are no checks.

Chris_Atkinson · ‎2020-02-09

@Tomer_Noy Are you able to share any further info on the applicable JHF takes & SmartConsole builds now?

CCSM R77/R80/ELITE

Tomer_Noy · ‎2020-02-13

R80.20 JHF take 100 (that was just released) includes these fixes.

They are listed under: PRJ-7609

Chris_Atkinson · ‎2020-02-13

Thanks Tomer!

(Now available per sk137593)

CCSM R77/R80/ELITE

Are you a member of CheckMates?

R80.20 Management Performance