Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
David_Evans
Advisor
Jump to solution

Policy Push Time. Accelerated Policy Install. Global Objects

   What are others seeing for policy install times of large policies, lets say 5,000 lines to a single firewall?   

  Our policy installs are again stretching out beyond 20 minutes.    Over the last 6+ years I have worked with TAC on a ticket about once a year to work on policy push speed.   Generally, after several months of work, we can get it down to 10 to 15 minutes.  This usually involves several custom hotfixes that are eventually rolled into a Jumbo.    Then over time, newer Jumbos, policy growth / changes, major version upgrades,  feature additions, it works its way back towards the 20 minute mark.   At which point, the cycle starts over again and I spend months working to speed things back up again.  

There are no CPU constraints on the local MDS, nor the firewall itself.   Pushing policy across a slow link to the other side of the planet is slower, but not significantly slower than to a firewall in the same datacenter on 10 gig links and sub 1ms ping times.     I can push to 50+ firewalls in 25 to 30 minutes but 1 IP add to 1 group on 1 rule to 1 firewall is back up to 20 minutes since our recent upgrade of MDS to R82.

Accelerated Policy push is amazing, the 1% of the time we are able to use it.   We share groups extensively across our various domains so the majority of our changes are to network groups in the global domain.   Any assignment of the global domain to the domain you are working in causes a full policy to every firewall in the domain.   So its very rare that it is available on any given policy push.   I do have an open RFE (I think the 2nd or 3rd attempt) to fix the fact that global assignment breaks accelerated push.

TAC and our sales contacts have been a bit vague on how others are dealing with long policy push times and if we are the only ones complaining about this ongoing issue.   So I'm asking to see what others are seeing pushing "large" policies out to firewalls.  

1 Solution

Accepted Solutions
David_Evans
Advisor

@David_Evans wrote:

.......
   Especially as we also sometimes see the message: "There are to many unpublished changes since the last installation for accelerated push."    Depending on how many is to many, that may be my next ask.   It would be very disappointing to remove the Global issue and replace it with a "to many changes" issue.    Also, "to many changes" isn't listed in sk169096 as a limitation.


So we got our custom hotfix....    and so far, this is exactly what has happened.    All of the errors about "Accelerated push is not compatible with the changes made since the last install"(global changes), have changed to "Not accelerated because there are too many unpublished changes".    

Now hopefully this is some minor issue as there were zero locks / unpublished changes in the domain.    So this error message must have some other meaning besides the common usage of "unpublished changes". 

View solution in original post

24 Replies
TurgutKaplanogl

Hello,

Are the 50+ firewalls you’re pushing the policy to SBM models, meaning their OS is Embedded Gaia? Additionally, are you using a single policy package for all 50+ firewalls? Also are you managing the 50+ firewalls with a single CMA? I think providing more detailed and comprehensive information would be helpful for this type of question.

Thank you

David_Evans
Advisor

I have 4 primary CMA's.    Each CMA has between 15 and 400 firewalls, with a range of policy sizes.  From 100 lines to 5000+ with 100+ inline layers.   The 50 firewalls I was referring to above was the same policy pushed out to the 50 firewalls that run that specific policy vs pushing to only one firewall that runs that policy.

The vast majority of the policy push time is on the MDS, compiling policy, before it ever attempts to push the policy out to the firewalls.   So I was not specific on the firewalls the policy gets pushed to as it seems to make little difference.   We have a broad mix of firewall hardware, VSX, maestro, CloudGuard cores spread across on prem and AWS, Azure,  standard active passive clusters from 3800's to 16,000's.     The firewall hardware size and location on the planet makes some logical sense, in that pushing a large policy to a small or busy piece of hardware will take a minute or two longer but when dealing with a average time of 20 minutes, has little over all impact. 

0 Kudos
TurgutKaplanogl

Hello,

In my opinion, the number of rules you mentioned is not very high if proper sizing has been done. In this case, the slowness issue can be addressed by checking the sizing information and limitations (including considerations to keep in mind during configuration and deployment) provided in the two SKs I shared below for your MDS, CMA, and MLM setups.

https://support.checkpoint.com/results/sk/sk183689

https://support.checkpoint.com/results/sk/sk178325

Additionally, some optimizations will need to be applied for a management architecture of this scale. I also recommend enabling MDPS on some gateways for testing purposes. I have shared details about MDPS(Management Data Plane Separation) in the SK below.

https://support.checkpoint.com/results/sk/sk138672

0 Kudos
David_Evans
Advisor

I am running 6000XL's and other than total number of firewalls in one CMA, I'm not over any limits and I think with R82 and the current jumbo I'm again now back under that limit.    I can be the only admin on the 6000XL in the middle of the night, when no other policies are pushing and we are still currently out in the 20 minute range.
I'd be willing to do some testing with MDSP, but as I stated, pushing the same policy to wildly varying hardware will make only a 5 or 10% change in the overall time.

What I'm asking is, is this constant fight that I've had over the last 6+ years to attempt to keep my policy push times around the 10 minute range what other customers are seeing? 
  
10 ish minutes to load a policy in the middle of a outage is generally acceptable by everyone involved.   But as we get out to 20 minutes and beyond it is very frustrating and really doesn't look good for Checkpoint.

Are other customers just accepting this?   Working around it in other ways?   Given up after a similar experiences with TAC cases that drag on attempting to address this issue for months and years?

Here is a common scenario:
Sitting on an outage bridge with a broad set of infrastructure and application support teams represented and after a few minutes of troubleshooting it is found that for whatever reason the fix we have decided to implement is to add a single IP to a single rule.    I say OK, I've done that, now it will be 20 minutes  to push that out to the one firewall that needs it.    20 minutes is a very long time to wait on a call with production down and no progress being made.   Even more so when at 2 minutes after you start the policy push, the application team asks you to add a second IP.   Now we are at 40 minutes to fix the issue.   Sometimes, if I can turn off all our automation that might trigger a global policy reassignment, and I break some of our other procedures, I can add the second IP in such a way that I MIGHT get an accelerated push for the second IP.

This seems like I can't be the only customer of Checkpoint that feels this is a excessively long time to update policy on a firewall when there are not CPU or bandwidth constraints anywhere in the system.



0 Kudos
TurgutKaplanogl

Hi,

The 20 minutes duration and example you mentioned are of course not an acceptable timeframe. I believe there can be multiple reasons for this. In environments with the rule sets you specified, such durations are not observed. When such situations occur, we address them by applying different solutions to bring the timing to a more acceptable level. This can sometimes involve optimization of the rule set, performing an action on the gateway where the policy is being installed, adjusting buffer limits of services such as CPM, FWM etc. on the management server or making adjustments to the management architecture. However a 20 minutes policy installation on a single gateway is not considered a normal scenario.

If you wish, you can work with Check Point PS or TAC team.

Note: Please check you total GW objects in MDS for limitation;

https://www.checkpoint.com/downloads/products/smart-1-6000-security-management-datasheet.pdf

Ekran Resmi 2026-03-18 20.58.46.png

Thank you

(1)
the_rock
MVP Diamond
MVP Diamond

I also believe engaging PS team would definitely be a great idea, for sure.

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos
David_Evans
Advisor

This is not exactly related, but the spec sheet for the 6000 series is still the version from 2022.   R82 and R82.10 have much higher listed max number for many of the limits.    2022 when the 6000 spec sheet was listed we were in R80.30?    So how do you figure out the real supported max for this hardware today with the current OS versions?
Until the 7000 series came out I always assumed that a 6000XL plus would run the max listed in the release notes for the OS version.    When I've pressed Sales, Diamond or TAC for what the new max's are I get some very vague answers.   
It seems like a hardware to OS version grid is really what is needed for the recommended limits if each new OS is really increasing these numbers.
The other item I've asked about is:
OK, I'm slightly over the recommended count on this one limit, but WAY under on these other 3.   So am I OK?    Other than the one single core that is maxed during policy compile, I have significant amounts of free CPU, RAM, and disk IO by every measure I can find.    This would seem to say that capacity of the MDS hardware is not the issue.

PhoneBoy
Admin
Admin

Datasheet numbers are generated with whatever version is the most recent at the time.
They generally don't get updated with new software versions. 
A lot of things can impact the maximum number of connections a given appliance can actually track, including actual traffic patterns, usage of NAT, and software blades enabled.

As far as the the length of time it's taking to install a large policy on your MDS, it's likely the single-threaded nature (i.e. single core) and the 32-bit nature the fwm process (limits a process to 4GB of addressable memory) which is not able to utilize the full resources of the MDS. 

Timothy_Hall
MVP Gold
MVP Gold

Great point Dameon, forgot that fwm is still 32-bit even though we've been using a 64-bit OS pretty much since Gaia was first introduced.  The newer cpm process is definitely 64-bit.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

The delays in installing policy can be summed up by one daemon: fwm. 

In the modern management architecture, one of the few remaining responsibilities of the fwm daemon is INSPECT code generation and compilation (cpm has taken most of the rest).  Unfortunately, because use of the postgres database is not known to the legacy fwm daemon, all relevant objects and data must first be dumped out of the postgres database back into the legacy files like objects_5_0.C and rulebases_5_0.fws so fwm can work on them.  For a full policy installation, this takes a while and is known as the "Legacy Dump".  One of the reasons accelerated policy installs are so fast is that they use the "Modern Dump", where many of fwm's compilation duties are performed ahead of time in a multi-threaded daemon like cpm.  This information about the different dumps is in the current CCSE R82 courseware, but doesn't appear anywhere else in SecureKnowledge or the Documentation.  Can't remember if it was ever there or got removed.

fwm is single-threaded, so all the cores in the world will not matter; in an MDS environment, there are multiple instances of fwm to service the Domains, but each one is still single-threaded.  I don't believe Smart-1s have Intel's Turbo Boost enabled, which would allow a single core to overclock and help these types of single-threaded operations complete faster.  Gateways don't generally support Turbo Boost either, though the 9300/9400 models are starting to use it.  Hopefully SMT/Hyperthreading is not enabled on the Smart-1's in question here (cat /sys/devices/system/cpu/smt/active), as single-threaded processes like fwm do not benefit from SMT and will incur at least a 11% penalty.  Remember reading about it in an SK article and can't find that now either.  The consensus at CheckMates seems to be that the lower Smart-1 models (600/700) have SMT on, while the higher models (6000/7000) don't.

Ever notice how after changing something on a gateway or cluster object and hitting OK it takes a long time, to the point that the SmartConsole GUI client blurs out and maybe even stops responding?  That's because fwm has to handle that specific operation involving gateway objects.  Why is the installation of Threat Prevention policies usually so much faster than Access Control?  Because fwm is generally not involved with TP policies.

So the bottom line is that until Check Point can finally replace the legacy fwm daemon's functions with something that is not single-threaded, these kinds of issues will continue.  There is very little Check Point administrators can do about it.  At some point along the way in R8X the Access Control policy verification process was taken away from fwm and given to cpm, which resulted in a big performance improvement for those types of operations.  The other legacy daemon fwd was "scaled out" in R82 with multiple threads possible (sk182215: "You have reached the maximum capacity this worker's configuration can handle" message in ...), hopefully something like this is in the works for fwm.  But it would seem to me that policy compilation is, by its very nature, a linear process that may not lend itself well to parallelization. 

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
David_Evans
Advisor

Your explanation matches what I see.    A single FWM process that sits at 100% for 15+ minutes, on a single core, then MDS starts talking to the actual firewall and pushing policy out for the next 5+ mins.    Policy rule count, and object count have far more affect on the policy push time than how busy the MDS is with other tasks, anything about the destination firewall, or the network between them.
Small simple policies push fast, big complex policies push slow.  Everything else is a rounding error.

I'm more frustrated this time as, after the upgrade from R81.20 to R82, I'm starting the cycle over again.   That change on my  MDS only took me from a borderline annoying ~15 minutes running 2 custom hotfixes to address policy push time to 20+ minutes after upgrading to R82 and adding back in the same custom hotfixes.   

One of the reasons I waited this long was, I was hoping that some other customers burned their time to do the tuning on R82 to make it at least as good as R81.20 when pushing out large policies.   I guess I need wait longer before I move to R82.10.

I'm really looking for other customers that are having the same experience or to see if I am still a early adopter, upgrading MDS 6 months after its the recommend version.

Timothy_Hall
MVP Gold
MVP Gold

I just poked around in R82.10, and it doesn't look like anything has changed with the fwm process, at least as far as I can see.  fwm is also why some GUI functions are still stuck in the legacy SmartDashboard and the old SmartEvent client GUIs.  I'm also not seeing any mention of improvements for fwm in the R82.20 EA release notes, although to be fair it is still EA, and I don't physically have the R82.20 code yet to look at.

One optimization technique would be to remove any unused objects displayed by the Objects Explorer, which would reduce the size of the legacy dump and give fwm less data to process during compilation.  Can't really think of much else to do.

unusedobj.png

 

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
David_Evans
Advisor

 My previous RFE's for fixing accelerated policy push after global assignment to the domain have ended in a hard NO.   Or at least a,  roadmap item so far in the future that it doesn't actually have a version number assigned to the OS yet, which in my experience is the same thing as a hard NO. 
I've just had my weekly diamond meeting.  Today I was told it was coming in a future jumbo.    It was hinted it was already in progress as a feature add.
Now when I pressed further for details, like:

Which Jumbo?

Which year / quarter?

  Will it be just MDS at that version or will it require a Jumbo / min Version on the firewall target?

Then things got a little more cloudy but I'm hopeful that is a result of just not having the details yet, and not the result of no work actually having started on the development of the fix.    As the previous explanations of why a global assignment breaks the accelerated policy push have been that it was a significant issue with "global" and the way the accelerated dump is processed.   And that interaction wasn't going away until basically all the above issues with FWM went away.

So I'm hopeful that this is really coming "soon"™️.   

0 Kudos
PhoneBoy
Admin
Admin

Above and beyond the issue you raise here, there is one product we are pushing driving reduced need for fwm: Web SmartConsole.
Web SmartConsole needs REST API to do what it does.
While we've come a long way since R80 initially added REST API support 10 years ago, there are still several functions that don't have such support...all of which live in fwm.

As some of these functions were implemented before REST APIs were standard, they weren't always implemented in a REST API-friendly way.
This means these functions must be reimplemented to be integrated into something outside of fwm.
All of that takes time, of course.
Note that some features (Legacy VSX) may never have REST API support, though it has a replacement (VSnext) that is.

Bottom line: this is definitely in the roadmap.

Tomer_Noy
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

I have some "inside information" on this 😉

It will only require an MDS update, no need to update the gateways for this enhancement.

If you'd like to get a private HF for your lab, even before it reaches JHF let me know and we can try to promote it. That can give you an early chance to see that it meets your needs.

It's a bit hard to say exactly which JHF, but we are working on it, so hopefully not very far in the future. Definitely this year.

BTW, we didn't get rid of fwm to do this, but it was a non-trivial development.

David_Evans
Advisor

I received the same basic information from my Diamond representative on Friday.     This makes me very hopeful that it really is on it way and will address a large portion of our policy install slowness.

We may look at a private HF to start some testing.

Now putting back on my disgruntled customer hat, 20 minutes is still a very long time for the remaining full policy installations.   Especially as we also sometimes see the message: "There are to many unpublished changes since the last installation for accelerated push."    Depending on how many is to many, that may be my next ask.   It would be very disappointing to remove the Global issue and replace it with a "to many changes" issue.    Also, "to many changes" isn't listed in sk169096 as a limitation.

0 Kudos
David_Evans
Advisor

@David_Evans wrote:

.......
   Especially as we also sometimes see the message: "There are to many unpublished changes since the last installation for accelerated push."    Depending on how many is to many, that may be my next ask.   It would be very disappointing to remove the Global issue and replace it with a "to many changes" issue.    Also, "to many changes" isn't listed in sk169096 as a limitation.


So we got our custom hotfix....    and so far, this is exactly what has happened.    All of the errors about "Accelerated push is not compatible with the changes made since the last install"(global changes), have changed to "Not accelerated because there are too many unpublished changes".    

Now hopefully this is some minor issue as there were zero locks / unpublished changes in the domain.    So this error message must have some other meaning besides the common usage of "unpublished changes". 

David_Evans
Advisor

@David_Evans wrote:

@David_Evans wrote:

.......
   Especially as we also sometimes see the message: "There are to many unpublished changes since the last installation for accelerated push."    Depending on how many is to many, that may be my next ask.   It would be very disappointing to remove the Global issue and replace it with a "to many changes" issue.    Also, "to many changes" isn't listed in sk169096 as a limitation.


So we got our custom hotfix....    and so far, this is exactly what has happened.    All of the errors about "Accelerated push is not compatible with the changes made since the last install"(global changes), have changed to "Not accelerated because there are too many unpublished changes".    

Now hopefully this is some minor issue as there were zero locks / unpublished changes in the domain.    So this error message must have some other meaning besides the common usage of "unpublished changes". 


Status Update:

So, after a couple days and a few hours on the phone with our diamond rep enabling debugs that should show an exact reason why a specific policy push was not accelerated, I'm still underwhelmed.  

The custom hotfix for global policy push did allow us to see an accelerated push ~10% of the time up from <1%.  The debugs did give the specific UID of the work session that caused the push to not be accelerated.   However, when you would look at that work session, the changes made were all things that should NOT have broken accelerated push.

So we are still trouble shooting that piece of the accelerated push.   Why are so many of them giving the reason.  "... not compatible with changes made..."   but the changes made do appear to be compatible.
The issue with to many changes, is documented to be a limit of 100 publishes of changes.    Not unique changes but the total times any administrator pushes the button to publish any number of changes must be  less than 100.     However, with the number of times I've seen that message, I think there is something else going on there as well.  We do a lot of changes but there are some days where there are less than 100 publishes across all the domains and we still see the message (before the hotfix).  We have not started to trouble shoot this error yet.  
So the accelerated policy push is not a fix to the long policy push times yet.    There is no predictable way to know when you are going to get an accelerated push so is not really useful since you cannot rely on it.

My custom hotfixes that were making improvements for policy push times in R81.20 MDS did not work as intended in R82.   In R81.20 they improved the push time to R81.20 firewalls which still make up the majority of our firewalls.
When they were rewritten for R82, they assumed R82 firewalls.    So that is why our policy install times doubled when MDS switched to R82.   We did not update many hundreds of firewalls to R82 on the same day we updated management to R82.   So they are now looking to see if they can be backported for R82 management to R81.20 firewalls.    Hopefully they can and we will be at least back down somewhere near 10 minutes for the average policy push times. 

 I'll keep the thread updated.

PhoneBoy
Admin
Admin

There are different binaries on the management server (in so-called "Backward Compatibility" packages) for compiling and installing policy for different versions of the Firewall.
If the fix was only include in R82 and NOT backported to the Backward Compatibility binaries, you would see the issue on R81.20 (and earlier versions).
That would imply another fix is needed both for R81.20 Management installations and for R82 to update the backward compatibility packages.

0 Kudos
David_Evans
Advisor

So another couple weeks have gone by with significantly more time spent troubleshooting, gathering logs and no progress.    

Still no hotfix to get the normal policy push time back down to where it was before the MDS upgrade to R82.

Accelerated push is still around 10% of our total policy pushes since fixing the global assignment issue.  The primary reason for policy pushes not being accelerated is:
"status=TOO_MANY_AUDIT_LOGS, detailedReason=[There are too many audit logs between worksessions, can't go over them all and decide if eligible]"

This is not the 100 times the "publish changes" button was pushed, but some other audit log.   We are still working to figure out what audit log it is referring to and what is generating "to many" of them. 

So I am at 2 months since my R81.20 to R82 upgrade.   Jumbo 73 plus 3 custom written hot fixes and still not back to the performance I had on R81.20.   I've also picked up a new performance issue with smart console speed.   Its regularly 10 to 20 seconds to move from policy tab 1 to policy tab 2.    We are working on that as a separate ticket with no progress.

This experience is what I wanted to see if other were having after MDS upgrades when I started this thread.    After any significant version upgrade of the MDS, we have months of troubleshooting issues, every time.    This time seems a bit worse than usual, but about what I expected.

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

Any chance this is related? sk184585: Concurrent policy installation causes high CPU usage in VSX environments

The audit log being referred to is almost certainly for configuration changes made by users in the SmartConsole GUI or management API connections.   I would assume that when deciding whether to attempt an accelerated policy install, the decision process looks back at changes since the last policy install and checks whether any are present that would disqualify an accelerated policy install.  Sounds like it won't look back more than 100 changes in the audit log and just punts to a full policy installation at that point.

Any chance you have something constantly accessing the management API and running up the number of "changes" by using read-write sessions instead of read-only sessions, when the only intent of the API session is to just read data without making changes?  Such as a third-party monitoring or reporting system?  Look up the "last published session" API capability if this is the case.

I'd suggest checking the audit log from the Logs & Events tab to see if something like this is constantly racking up "changes" that keep you over 100 all the time.  Might be some way to muzzle that functionality and keep you under 100 more often.

Presumably, there is also a way to increase the number of lookback changes above 100, but after poking around for a bit, I don't see an easy way to do so.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
David_Evans
Advisor

They did give me a way to change a setting specifically for audit logs from 1,000 to 3,000.  (not 100 that seems to be a different limit for actual changes).    This 1,000 limit appears to be raw audit log lines, regardless of what the log is documenting.  Even things like failed logins from Qualys scanners that get logged, go against this count.    Also, as you mentioned, API calls, even ones that are explicitly read only, also get counted.    Even, skyline custom scripts that elevate to expert mode may count.    We are still working on this and I don't have all the answers yet.   But with the API read only calls, and skyline custom scripts we hit 1,000 audit lines in a little over an hour without a single actual read write login being made.   A Qualys scan of even a few firewalls can take us over a 1,000 lines of failed logins in minutes.
We just went to 3,000 a few days ago so I have not got a good feel for how much of a help that is going to be.   Also, still waiting on validation that any audit log line, for anything, counts against the 1000/3000 count.     

David_Evans
Advisor

@Timothy_Hall wrote:

I just poked around in R82.10, and it doesn't look like anything has changed with the fwm process, at least as far as I can see.  fwm is also why some GUI functions are still stuck in the legacy SmartDashboard and the old SmartEvent client GUIs.  I'm also not seeing any mention of improvements for fwm in the R82.20 EA release notes, although to be fair it is still EA, and I don't physically have the R82.20 code yet to look at.

One optimization technique would be to remove any unused objects displayed by the Objects Explorer, which would reduce the size of the legacy dump and give fwm less data to process during compilation.  Can't really think of much else to do.

unusedobj.png

 



We do have some object cleanup we can do.   We have some of this already scripted, but the API calls we were using in R81.20 started not seeing some global network groups as "in use" if they were only used in NAT rules or Encryption Domains so we paused some of the automation around this.    We are starting to test again in R82 to see if the same issue exists.   But we are looking at maybe a 5% reduction in total objects, so I'm not thinking that is going to be a huge help.

I thought the issue with dumping all the objects was a thing of the distant past.   Like starting in 77.x or 80.x checkpoint no longer sent the full database of objects down to all the firewalls.    At that point it only sent objects that are actually used in the specific policy as part of the policy.   Are you saying that this legacy dump, dumps the full database, every time and not just the objects specifically used in the policy it is current compiling?   That would seem to not track with how much faster the smaller policies compile. 

0 Kudos
the_rock
MVP Diamond
MVP Diamond

Hey David,

Lets see what others have to say, but to me, logically, if someone has 5000 rules, 10-15 mins policy push time sounds sort of normal/expected.

 

Best,
Andy
"Have a great day and if its not, change it"
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events