Gateway fw_worker crash and reboot on upgrading fr...

Adam276 · ‎2025-03-31

Going from R80.10 on a gateway cluster to R81.20. Management is already on R81.20 (upgraded from R80.40 a while ago). R81.20 HFA is take 90. Old hardware R80.10 is the latest HFA available. I verified that upgrade from R80.10 to R81.20 is allowed for gateways as long as R80.10 is 64bit (which it is). I am not upgrading in place. I am installing to new hardware from scratch.

I tried an MVC upgrade (which I found out is the default for new install of R81.20). I installed fresh on new hardware for the secondary. Settings the same (default gateway route was somehow missed though). When I pushed policy to the secondary (after establishing SIC and changing object to R81.20 version of course), The secondary lost all connectivity to remote and local subnets. Even local management network. The gateway also rebooted all by itself. I had to do an unloadlocal to get connectivity back because the policy was stuck in defaultfilter that is normally used for bootup. After unloadlocal, Pushing policy again had the same result except this time the firewall actually said on the screen that it was going down for reboot or something like that (someone else was watching the screen so I didn't see the message).

Since then, I have looked at the logs and I do see crash dumps of fw_worker at the same time that both policy push attempts were made to the secondary (both the initial policy and the second time I tried).

If you try to load the local stored policy manually from the local system, it gave an error. It was corrupt somehow (maybe because of loss of connectivity to management during push) and wouldn't load. I had to recreate the InitialPolicy per an SK to get the firewall out of the defaultfilter policy and back to InitialPolicy. I haven't done anything with it since then.

There was no downtime for this issue as the old primary was still active the entire time. I put the old R80.10 secondary back in place after the problems.

Looking at SKs and going over the documentation again, I found 2 things that piqued my interest.

1. R81.20 Critical Information page: Security Gateway may crash when route lookups encounter an unresolved next hop. (PRJ-49644). This maybe not be it because the page mentions this was in Take 96 and fixed in 98. I used Take 90. Because I didn't have a default route, I am thinking maybe that is a possibility. I setup a VM lab and tested this scenario though where default gateway on the new secondary was missing and didn't get a crash. I used a simple policy (no vpn, etc) for testing though so it wasn't testing the actual policy.

2. I found in the R81.20 documentation that states that MVC is not supported when going to new hardware (and that is now default for new installs unless you disable MVC). The new Dell hardware is fully supported on the Checkpoint hardware compatibility list and it is a newer dell model than the older hardware but it has the same interface names, same number of cores, same corexl setup, similar bios settings, etc. I am not sure that would cause a crash or just a state sync problem. I do see in the logs that the fw_worker crash happened after the full sync happened but lots of things happen all at the same time during a policy push so not necessarily the cause.

Anyone know what PRJ-49644 really means as far route lookups and if it is possible it could apply for Take 90 even though the article mentions take 96?

Anyone know why MVC isn't supported for hardware upgrades (maybe as a safety for those that upgrade to different corexl, CPU cores, etc)? And could it cause a fw_worker crash?

PhoneBoy · ‎2025-03-31

So you had one node on R80.10 and another on R81.20 with MVC?
Fairly certain this isn't supported.

Adam276 · ‎2025-03-31

That is correct. R80.10 primary and the upgrade is starting with the new R81.20 secondary hardware install from scratch.

According to the R81.20 Installation and Upgrade Guide, MVC with R81.20 cluster member is supported with R77.30 and R80.10 or higher. This is a 2 member cluster.

"The Multi-Version Cluster (MVC) in an R81.20 Cluster Member supports synchronization with
peer Cluster Members that run one of these versions:
R80.10 (or higher)*
R77.30
In a Multi-Version Cluster, the Cluster Members can run only these versions:
R81.20 and R80.10 (or higher)*
R81.20 and R77.30"

The * just mentions to check the Release Notes for upgrade paths. The release notes don't mention MVC but do state that R80.10 is ok to go to R81.20 as long as the R80.10 is 64 bit. I checked and the R80.10 is 64 bit..

The only mention that MVC might not work is in the Upgrade Guide (Multi-Version Cluster Limitations) where it specifies that MVC is not supported when installing to new hardware. No specifics in that though.

"The Multi-Version Cluster (MVC) upgrade does not support the replacement of the
hardware (replacing the entire cluster member).
The MVC upgrade supports only multi-version software."

PhoneBoy · ‎2025-03-31

The regular upgrade path for R80.10 to R81.20 involves going through R80.40.
I assume this is the case on the gateway side as well, even with MVC.

Regardless, you're in TAC territory here.

Adam276 · ‎2025-03-31

Yea, The upgrade guide definitely states that for gateways the path from R80.10 to R81.20 is supported. Only for management upgrade for R80.10 to R81.20 do you have to go through R80.40 first. MVC is stated to work for gateways between those versions but not to new hardware. That is the most logical that somehow to new hardware MVC is causing the crash. I assume PRJ-49644 is not related to the crash since it was stated to only be a bug in a newer HFA.

PhoneBoy · ‎2025-03-31

Clusters with different hardware types are not supported.
That includes with MVC.

Lesley · ‎2025-03-31

Take 99:

PRJ-56910,
PRHF-35918

Security Gateway

The FWK process may unexpectedly exit after policy installation failure.

PRJ-54402,
PRHF-33615

Security Gateway

In rare scenarios, after an upgrade, the FWK process may unexpectedly exit because of memory corruption.

PRJ-58860,
PMTR-110741,

PRJ-56437,
PRHF-35363

Security Gateway

In rare scenarios, the FWK process may unexpectedly exit.

And more FWK related bugs. These are just the highlights

-------
Please press "Accept as Solution" if my post solved it 🙂

Are you a member of CheckMates?

Gateway fw_worker crash and reboot on upgrading from R80.10 to R81.20 applying first policy