Going from R80.10 on a gateway cluster to R81.20. Management is already on R81.20 (upgraded from R80.40 a while ago). R81.20 HFA is take 90. Old hardware R80.10 is the latest HFA available. I verified that upgrade from R80.10 to R81.20 is allowed for gateways as long as R80.10 is 64bit (which it is). I am not upgrading in place. I am installing to new hardware from scratch.
I tried an MVC upgrade (which I found out is the default for new install of R81.20). I installed fresh on new hardware for the secondary. Settings the same (default gateway route was somehow missed though). When I pushed policy to the secondary (after establishing SIC and changing object to R81.20 version of course), The secondary lost all connectivity to remote and local subnets. Even local management network. The gateway also rebooted all by itself. I had to do an unloadlocal to get connectivity back because the policy was stuck in defaultfilter that is normally used for bootup. After unloadlocal, Pushing policy again had the same result except this time the firewall actually said on the screen that it was going down for reboot or something like that (someone else was watching the screen so I didn't see the message).
Since then, I have looked at the logs and I do see crash dumps of fw_worker at the same time that both policy push attempts were made to the secondary (both the initial policy and the second time I tried).
If you try to load the local stored policy manually from the local system, it gave an error. It was corrupt somehow (maybe because of loss of connectivity to management during push) and wouldn't load. I had to recreate the InitialPolicy per an SK to get the firewall out of the defaultfilter policy and back to InitialPolicy. I haven't done anything with it since then.
There was no downtime for this issue as the old primary was still active the entire time. I put the old R80.10 secondary back in place after the problems.
Looking at SKs and going over the documentation again, I found 2 things that piqued my interest.
1. R81.20 Critical Information page: Security Gateway may crash when route lookups encounter an unresolved next hop. (PRJ-49644). This maybe not be it because the page mentions this was in Take 96 and fixed in 98. I used Take 90. Because I didn't have a default route, I am thinking maybe that is a possibility. I setup a VM lab and tested this scenario though where default gateway on the new secondary was missing and didn't get a crash. I used a simple policy (no vpn, etc) for testing though so it wasn't testing the actual policy.
2. I found in the R81.20 documentation that states that MVC is not supported when going to new hardware (and that is now default for new installs unless you disable MVC). The new Dell hardware is fully supported on the Checkpoint hardware compatibility list and it is a newer dell model than the older hardware but it has the same interface names, same number of cores, same corexl setup, similar bios settings, etc. I am not sure that would cause a crash or just a state sync problem. I do see in the logs that the fw_worker crash happened after the full sync happened but lots of things happen all at the same time during a policy push so not necessarily the cause.
Anyone know what PRJ-49644 really means as far route lookups and if it is possible it could apply for Take 90 even though the article mentions take 96?
Anyone know why MVC isn't supported for hardware upgrades (maybe as a safety for those that upgrade to different corexl, CPU cores, etc)? And could it cause a fw_worker crash?