SmartCenter - SmartConsole Freeze, API unavailable...

Alexander_Wilke · ‎2024-10-12

Hello,

CheckPoint Professional Services did a migration from an MDS with 10 Domains to a SmartCenter with 12 Policy packages for us.
We are running R81.10 and JumboHFA Take 150. The migration worked fine.

However we still have (major) performance issues with the environment (and we had these with the MDS, too and hoped a fresh environment with all rules fresh added by API calls will solve issues which may have existied in our old MDS database). It did not. 😞

If we are editing gateways with several hundred interfaces (e.g. adding a logserver or somethings else within the cluster/gateway object) and then clicking "OK" and after that doing the Publish. Clicking "OK" in the CLusterObject menue needs 10 minutes to succeed. The publish takes the same time again. This is very bad with gateway having 300 oder 500 interfaces. CLusters with more members is worse than single gateways (probably because member IPs and virtual IPs).

The next bad situation is, if we have 200 changes in SmartConsole (RUlebase, Policy) and do publish the publish takes several minutes to finished. While this publish the SmartConsole freezes.

Professional Services migrated the Policy using a script which runs agains the API. To add 12.500 rules we needed around 7-8hrs which is very very long.

Unfortunately every time if there is such a "bigger" change the SmartConsole or API it is not possible for other users on different machines to login to SmartConsole. API is not available/responding and existing API tasks which are running may fail.

If we do the changes via SmartConsole, we do not see any load on the Windows client running the SmartConsole in Task Manager.
If we do the changes via API or SmartConsole, on the SmartCenter we can see that 1-2 processes are running with higher load of about 80% (postgres cpm) and sometimes short peaks of less than 1 second of java (CPM) with up to 300% (probably uses more than one CPU core).

We do not see load or latency on the disks. no network traffic spikes.

From our perspective it looks like there is not bottleneck of CPU or memory. We assume there are issues with the API or the API calls against the database.

In parallel we will open a ticket with our Diamond Engineer because Professional Services was running out of ideas at this time.
However I would like to know your experience with API and SmartConsole and maybe how you solved it. if there are any parameters we can tweak please let uns know. We have 32 CPUs and 256 GB RAM.

Regards

the_rock · ‎2024-10-12

Ok, just a shot in the dark as they say, but is it possible there are maybe lots of revisions?

Andy

Best,
Andy

JozkoMrkvicka · ‎2024-10-12

TAC should provide you cpm_doctor script which might help you to find problems related to slowness.

Kind regards,
Jozko Mrkvicka

Tal_Paz-Fridman · ‎2024-10-13

The performance problem is related to the high number of interfaces because of the topology calculations. There is a dedicated hotfix that might help. I suggest asking Diamond Engineer or Professional Services consultant to look at TM-31531.

JozkoMrkvicka · ‎2024-10-13

Are we talking about "In Security Gateway or Cluster with more than 200 interfaces, SmartConsole freezes or crashes" and "The "Get Interfaces" operation in SmartConsole may take a long time, freeze, or fail, when the Secur..." ?

Kind regards,
Jozko Mrkvicka

the_rock · ‎2024-10-13

Gotta be it, I found the same.

Andy

Best,
Andy

Alexander_Wilke · ‎2024-10-13

Hello,

thank you for taking time and trying to help.

1. The issue does not seem to be relevant to the number of revisions. It is correct we had - because of the migration phase, around 800 revisions. I deleted these and created a sheduled API task to purge revisions with "number-of-sessions-to-keep 50" which runs every 12 hours. This did not improve the situation as I woul expect.

2. The situation with editing the Interfaces of clusters is well known for us. However the situation I am talking about is not with editing the interfaces of these clusters it is with editing the cluster itself like adding/change the LogServers or other parameters. It will result in long publish times. As I said earlier - it looks like it is related to the amount of interfaces of this cluster but it is not related to directly editing these interfaces. I will aks the Diamon Engineer for this TM-31531.

the_rock · ‎2024-10-13

Sounds good, let us know what they say.

Andy

Best,
Andy

Hugo_vd_Kooij · ‎2024-10-14

I can confirm from practical experience that a large number of interfaces on a firewall has a significant impact on any action taken on that object.

The customer with the largest amount of subnets active requires almost 5 minutes just make a single change on the firewall cluster object. The impact seems to behave non-linear. So adding 1 interface to a list of 200 does add more processing time then adding an interface to a list of 20 interfaces.

I guess you can even test this in DEMO mode.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

the_rock · ‎2024-10-14

Totally logical point!

Best,
Andy

JozkoMrkvicka · ‎2024-10-16

One more practical experience for cluster with more than 800 interfaces - even opening the cluster and closing it WITHOUT ANY ACTION (open cluster -> Cancel) takes too much time.

Kind regards,
Jozko Mrkvicka

PhoneBoy · ‎2024-10-14

There are a couple of different issues here:

Publishing a large number of changes at once: for performance reasons, we generally recommend publishing after around ~100 changes. You may also try this: https://support.checkpoint.com/results/sk/sk119553
A large number of Cluster interfaces: Is this a regular cluster or VSX? Either way, I suspect a lot of CPMI (legacy) code is involved here.

Note that VSnext (replacement for VSX) and ElasticXL (replacement for ClusterXL) will be available in R82.
These mechanisms are more API friendly than existing VSX/ClusterXL.
Also of note, interfaces for gateway objects can be created via API in R82.

Alexander_Wilke · ‎2024-10-14

Hello,

we do not have any VSX environments.

I checked the SK you mentions and it tells me:
"This sk is not relevant for R81."

As a first summary from my point of view:
- The long time to edit a cluster object or a single gateway object is the result of the topology/interface checks. This takes very long time. If I have a look at the data sheets of the Firewalls which allow 1024 interfaces .... it will probably take 30 minutes or more to do a change. This is totally impractical for and enterprise piece of software. However - will wait for the "TM-31531"

The other problem is with the API or publish changes. Even if we do not do publishs with more than 100 changes it will result in timeouts because if many administrators and many different applications using the API this will result in overlapping actions and locks/freezes

I do not see at the moment what solution exists for these issues. Looks like there is one process or task which is the bottleneck for all other tasks. Whichs task/process ist it? How can we improve its "speed" ?

Some of your suggestions was to use the API to use "get interfaces" or something like that - however - this will result in the same dead lock than all other tasks against the API. If we should use the API for more tasks then the API needs to be responsive and able to handle these many calls in parallel.

PS:
I ran "run_cpmdoc.sh" and it told me the audit log file ist very big (more than 1GB) - However I do not know how to rotate or purge this file. We send Audit Logs via log_exporter to our local SIEM solution so I think I can delete audit logs older than a few days - however I do not know how and if this is possible.

the_rock · ‎2024-10-14

Can you run -> find / -name *audit* and see what it gives you?

Andy

Best,
Andy

Alexander_Wilke · ‎2024-10-14

Do you mean these?

-rw-rw---- 1 admin root 4.9M Oct 8 15:46 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-08_154618.adtlog
-rw-rw---- 1 admin root 3.7M Oct 9 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-09_000000.adtlog
-rw-rw---- 1 admin root 9.1M Oct 10 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-10_000000.adtlog
-rw-rw---- 1 admin root 5.2M Oct 11 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-11_000000.adtlog
-rw-rw---- 1 admin root 11M Oct 12 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-12_000000.adtlog
-rw-rw---- 1 admin root 16M Oct 13 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-13_000000.adtlog
-rw-rw---- 1 admin root 17M Oct 14 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-14_000000.adtlog
-rw-rw---- 1 admin root 3.5M Oct 15 00:00 /var/log/opt/CPsuite-R81.10/fw1/log/2024-10-15_000000.adtlog
-rw-rw---- 1 admin root 154K Oct 15 01:04 /var/log/opt/CPsuite-R81.10/fw1/log/fw.adtlog

[Expert@xxxxxxx:0]# du -h /var/log/opt/CPsuite-R81.10/fw1/log/
0 /var/log/opt/CPsuite-R81.10/fw1/log/c-icap
0 /var/log/opt/CPsuite-R81.10/fw1/log/dtls_spool
0 /var/log/opt/CPsuite-R81.10/fw1/log/dtls_temp
194M /var/log/opt/CPsuite-R81.10/fw1/log/cpm_doctor
20M /var/log/opt/CPsuite-R81.10/fw1/log/saved_logs/Discard_Worksession
12M /var/log/opt/CPsuite-R81.10/fw1/log/saved_logs/Publish_Worksession
32M /var/log/opt/CPsuite-R81.10/fw1/log/saved_logs
4.0K /var/log/opt/CPsuite-R81.10/fw1/log/blob
40K /var/log/opt/CPsuite-R81.10/fw1/log/amonStatusFiles/0
4.0K /var/log/opt/CPsuite-R81.10/fw1/log/amonStatusFiles/1
44K /var/log/opt/CPsuite-R81.10/fw1/log/amonStatusFiles
0 /var/log/opt/CPsuite-R81.10/fw1/log/cl_del
22M /var/log/opt/CPsuite-R81.10/fw1/log/failed_tasks/IPS_Update
17M /var/log/opt/CPsuite-R81.10/fw1/log/failed_tasks/Policy_Installation
1.4M /var/log/opt/CPsuite-R81.10/fw1/log/failed_tasks/Other
40M /var/log/opt/CPsuite-R81.10/fw1/log/failed_tasks
0 /var/log/opt/CPsuite-R81.10/fw1/log/imported_logs
2.7G /var/log/opt/CPsuite-R81.10/fw1/log/
[Expert@xxxxxxx:0]#

the_rock · ‎2024-10-14

Not really, as those are not even close to 1 GB.

Best,
Andy

PhoneBoy · ‎2024-10-14

This is probably a better SK for R81+ and changing the hash memory size: https://support.checkpoint.com/results/sk/sk179819
Seems like like running the PUV (without doing an actual upgrade) might suggest a good starting point for this.
Possible TAC also might be able to make a recommendation based on what they've seen.

As for the API server not being performant enough in larger environments, I don't know what the actual issue(s) is.
However, I can say that a lot of operations involving gateway objects are still leveraging CPMI under the hood.
I suspect some of those operations are still single-threaded, which may account for some of the issues you're experiencing.

Flagging @Tomer_Noy

Alexander_Wilke · ‎2024-10-14

Hello,

CPM HEAP size was 12Gb by default. Professional Services said we should try to increase it to 32 GB so we did this. However it did not solve or change the performance/freeze issue. Maybe it allows more stability and memory is not a bottleneck for us.

Profile:
-------------------
Machine profile: 131072 or larger without SME
CPM heap size: 32768m

PS:
If CPMI is involved - which process should be relevant for that? I would expect this one process to run with 100 percent if thats the reason, right?

Are there any tweks for postgres service? Here are the default values from the profiles script:
I am no database expert at all however 120MB and 8GB are looking pretty conservative ?

:memory (
:max_value (
:name (NGM_CPM_POSTGRES_SHARED_BUFFERS)
:percentage (8)
:min_memory_allocation (8692m)
:max_memory_allocation (9728m)
)
)
:memory (
:max_value (
:name (NGM_CPM_POSTGRES_MAINTENANCE_WORK_MEM)
:memory_allocation (256m)
)
)
:memory (
:max_value (
:name (NGM_CPM_POSTGRES_WORK_MEM)
:memory_allocation (128m)
)
)
:memory (
:max_value (
:name (NGM_CPM_POSTGRES_TEMP_BUFFERS)
:memory_allocation (128m)
)
)
:memory (
:max_value (
:name (NGM_CPM_POSTGRES_EFFECTIVE_CACHE_SIZE)
:percentage (60)
:min_memory_allocation (12298m)
:max_memory_allocation (24576m)
)
)

I have 233 GB memory unused. Not used as cache not used for anything!?

[Expert@xxxxxx:0]# free -h
total used free shared buff/cache available
Mem: 251G 14G 210G 2.1G 26G 233G
Swap: 63G 0B 63G

PhoneBoy · ‎2024-10-15

Many (but not all) of the legacy operations flow through fwm.

Are you a member of CheckMates?

SmartCenter - SmartConsole Freeze, API unavailable, no login possible when doing Publish