Solved: Re: Zero Downtime Upgrade - R80.10 - R80.40

LostBoY · ‎2021-04-28

I will be following the following sk to upgrade my VSX Cluster from R80.10 to R80.40.

https://sc1.checkpoint.com/documents/R80.40/WebAdminGuides/EN/CP_R80.40_Installation_and_Upgrade_Gui...

My confusion is on the following part..

In the Install Policy window:

In the Policy field, select the default policy for this VSX Cluster object.
This policy is called:
<Name of VSX Cluster object>_VSX

Now, i have a VS also which carries the traffic of my envrionment..so in this step do i need to install only Cluster Policy or Cluster + VS Policy as well ?

Also, in the following part :

Stop all Check Point services:

cpstop

Notes:

This forces a controlled cluster failover from the old VSX Cluster Member M1 to one of the upgraded VSX Cluster Members.
At this moment, all connections that were initiated through the old VSX Cluster Member M1 are dropped (because VSX Cluster Members with different software versions cannot synchronize).

is this like a normal failover where on switching the members it cause a few timeouts and traffic is shifted to the new member..so ideally traffic should be normal after a few timeouts ?

Kaspars_Zibarts · ‎2021-04-29

You are right - document only refers to VS0 policy. If I'm honest, I always install all VSes just to be sure. Takes extra time but I think it's worth it. 🙂

As for failover to do damage control you can set to allow out of state connections before upgrade and revert back to normal after upgrade. This way if any of TCP connections isn't synchronised but is still ongoing, it will get accepted and there will be no need to restart that TCP connection (i.e. long running jobs like backups)

View solution in original post

Bob_Zimmerman · ‎2021-04-29

For the policy question, it depends. 'vsx_util upgrade' changes the version of the VSX cluster object, all the physical member objects, all of the hidden VS member objects, and all of the VS cluster objects. You should install policy with the new version before failing traffic to a member (physical or VS) running the new version. If you're doing the VSLS trick, you only need to install the VS0 policy to get it updated, then you can install the individual VS policies as you are ready to fail them over.

As for the second part, a Zero Downtime Upgrade is not a normal failover. R80.10 can't sync the connection table with R80.40. Think of it as rebooting the firewall, but it comes back up instantly rather than needing to wait for POST, wait for OS startup, wait for service startup, and so on. If somebody is downloading a 100 GB file, and you do the Zero Downtime Upgrade when they have 99 GB, that connection will not survive the failover. They will have to start the download over again (fortunately, most applications have ways to recover from interrupted connections now, but some still don't).

View solution in original post

the_rock · ‎2021-04-28

Good point there...I dont think same process is applicable on vsx as regular fw cluster. You might wish to check with TAC.

Best,
Andy
"Have a great day and if its not, change it"

Bob_Zimmerman · ‎2021-04-28

The "In the Install Policy window" part occurs multiple times in the process. Which one are you concerned about? It's specifically talking about the VS0 policy, which normally governs management access to the cluster members themselves. This policy is installed as a part of vsx_util reconfigure, but pushing after that is a good idea to get the policy rebuilt for the new version.

The failover when you stop services on member 1 would be a stateless failover. All ongoing connections will be lost, and new connections should work immediately. There is no time at which a new connection cannot be formed, thus zero downtime. If you want your upgrade to be more like a normal failover (to preserve long-running connections), you should look at the Multi-Version Cluster Upgrade.

Normal failovers shouldn't involve a few timeouts. The last several upgrades I have installed, nobody outside my team even noticed the change.

Kaspars_Zibarts · ‎2021-04-28

I would definitely go with MVC upgrade. Plus if you are running VSLS cluster mode as opposed to HA, you can fail over one VS at a time thus having better control over upgrade

LostBoY · ‎2021-04-29

i looked at the MVC upgrade but the connections i have are static NAT based..and it is a limitation in MVC ...hence going with Zero Downtime ..i wouldn't mind a few drops in connections as long as it gets restored in a min or two

Bob_Zimmerman · ‎2021-04-29

I think you misunderstood the limitation. It only applies to failovers from R80.40 back to an earlier version, which should only happen if the upgrade breaks things anyway. It also only applies if you are using VMAC mode.

LostBoY · ‎2021-04-29

ok..got it

One more thing here which is putting me off..

https://sc1.checkpoint.com/documents/R80.40/WebAdminGuides/EN/CP_R80.40_Installation_and_Upgrade_Gui...

at the bottom of this link a note states..

When Cluster Members of different versions are on the same network, Cluster Members of the new (upgraded) version remain in the state Ready, and Cluster Members of the previous version remain in state Active Attention.

Cluster Members in the state Ready do not process traffic and do not synchronize with other Cluster Members.

this is the condition before switching on MVC and will change once MVC is switched on.. is this correct ?

wudnt this condition auto correct once MVC is enabled.. isnt this always the condition during MVC upgrade that an upgraded member will always be in "Ready" state at first...But why then the next steps might be required like removing physical interfaces , shutdown interfaces etc..

LostBoY · ‎2021-04-29

yes thats correct..it specifically say install cluster object policy.. my confusion is..after upgrading secondary member i need to force a failover..in that case the upgraded member should have VS policy as well so as to cater the running VS traffic.. but in the sk it says install cluster object policy ..hence my confusion that only cluster policy is to be installed or cluster and VS both.

Also,normally during failover testing users didnt even noticed that something went wrong or changed..i just wanted to confirm that this is going to be same in this case..do you mean to say during the upgrade when the member is switched it takes more time to build connections as compared to failover scenario ?

Kaspars_Zibarts · ‎2021-04-29

You are right - document only refers to VS0 policy. If I'm honest, I always install all VSes just to be sure. Takes extra time but I think it's worth it. 🙂

As for failover to do damage control you can set to allow out of state connections before upgrade and revert back to normal after upgrade. This way if any of TCP connections isn't synchronised but is still ongoing, it will get accepted and there will be no need to restart that TCP connection (i.e. long running jobs like backups)

LostBoY · ‎2021-04-29

Thanks... this looks helpful

Bob_Zimmerman · ‎2021-04-29

For the policy question, it depends. 'vsx_util upgrade' changes the version of the VSX cluster object, all the physical member objects, all of the hidden VS member objects, and all of the VS cluster objects. You should install policy with the new version before failing traffic to a member (physical or VS) running the new version. If you're doing the VSLS trick, you only need to install the VS0 policy to get it updated, then you can install the individual VS policies as you are ready to fail them over.

As for the second part, a Zero Downtime Upgrade is not a normal failover. R80.10 can't sync the connection table with R80.40. Think of it as rebooting the firewall, but it comes back up instantly rather than needing to wait for POST, wait for OS startup, wait for service startup, and so on. If somebody is downloading a 100 GB file, and you do the Zero Downtime Upgrade when they have 99 GB, that connection will not survive the failover. They will have to start the download over again (fortunately, most applications have ways to recover from interrupted connections now, but some still don't).

LostBoY · ‎2021-04-29

Thank you.. this clears my confusion

Kaspars_Zibarts · ‎2021-04-29

That's what I meant by allowing out-of-state connections - then 99G will continue

Bob_Zimmerman · ‎2021-04-29

Yeah, but that has some other concerns. Most notably, it's a global property, so it applies to all firewalls in the environment. Very few people run just one VSX cluster by itself in a management, so this setting might get pushed to other firewalls completely unrelated to the upgrade.

Also, I don't think it adds ongoing connections to the connections table, it just doesn't drop them. This would deal with some long-running connections like the download or backup which eventually end, but some systems like ATMs often keep the same connection open for over a year with very little data. When you eventually switch the setting off, I think any connections like that will be dropped when you push policy.

Kaspars_Zibarts · ‎2021-04-30

Yes that's totally correct so you always need to evaluate against specific environment

Magnus-Holmberg · ‎2021-04-28

During upgrade from R80.10 to R80.30 i have experienced a few active / active scenarios that was less fun.
So i try to be extra careful for those and actually turn of the production nics to the VSX node (in the switch) to just make sure that everything works correct. and just keep the sync and vs0 open.

Am not 100% sure what the reason where anymore. but we experience on 3 clusters upgrades and after that we just said "F*** it lets make it bulletproof"

https://www.youtube.com/c/MagnusHolmberg-NetSec

Kaspars_Zibarts · ‎2021-04-28

it's never 100% guaranteed, i saw weird state even with only Mgmt and Sync connected during one of the latest rollbacks.. 🙂

Plus we are talking R80.40 and it's totally different beast to R80.30 hehe

Magnus-Holmberg · ‎2021-04-28

hehe, i have no production vsx on r80.40 yet 🙂
But am suspecting its similar upgrade as am running the r80.30 3.10 kernel.

https://www.youtube.com/c/MagnusHolmberg-NetSec

LostBoY · ‎2021-04-29

@Magnus-Holmberg @Kaspars_Zibarts your conversation is making me nervous.. 😄

LostBoY · ‎2021-04-29

hey Magnus.. big fan of your youtube content .. good to hear from you..😀

Active/Active scenario can occur once the members are switched ?

Magnus-Holmberg · ‎2021-04-29

I dont fully remember the senario, but i do believe it was after we have made the failover with VSLS and 1 member was on R80.10 (possible 32bit vs) and then second member came up with R80.30 (without any HFA) and then 64bit for the VS.
And the members simply didn´t see each other anymore so both went active instead of being Active / ready.
We didn´t spend much time troubleshooting as it was in the middle of the night, so instead when doing those jumps we killed all the interfaces except vs0 and sync so even if it would go active it would not take any traffic.

Having said that we recently made some testing upgrading r80.10 VSLS to R80.30 with CDT and it worked perfectly.

https://www.youtube.com/c/MagnusHolmberg-NetSec

motts786 · ‎2021-06-01

"At this moment, all connections that were initiated through the old VSX Cluster Member M1 are dropped (because VSX Cluster Members with different software versions cannot synchronize)"

Does this also apply to Jumbo Hotfixes?

_Val_ · ‎2021-06-02

No, when you install JHF, the members are on the same main version, and they continue sync without any additional effort.

Are you a member of CheckMates?

Zero Downtime Upgrade - R80.10 - R80.40