Solved: No Downtime (Zero Downtime) hardware refresh

melcu · ‎2024-12-12

Hi Mates,

I need an advice from you, experts!
One of my customers is going to upgrade some 5000 something gateways in a classic HA A-P cluster to some brand new 9400 gateways.

How in the world can I do this with zero downtime without messing with SND cores (as there's no way to revert back to 20/24 cores (or how many 9400 has) without , again, downtime!vI mean I could just change the number of SND cores to one 9400, join it in cluster (5x00 + 9400 with lowered cores) and it will be just fine, but errr ... my brain is in a boot loop and I can't figure it out!

The only way I see it is to remove standby member of actual 5000 cluster, add the new 9400 gateway and try to be flash fast to disable clusterXL on 5000 when 9400 becomes active.

Any ideas ? (will be highly appreciated).

Thanks

the_rock · ‎2024-12-12

Sadly, I doubt anyone can guarantee them they would not lose a single packet. Last time I did this, no packets were lots, though I always see one time out when we run constant ping.

Andy

Best,
Andy
"Have a great day and if its not, change it"

View solution in original post

the_rock · ‎2024-12-12

I would follow below process. I had done it many times and no issues,

Best,

Andy

https://community.checkpoint.com/t5/Security-Gateways/Replace-Upgrade-Cluster/m-p/157228#M27268

Best,
Andy
"Have a great day and if its not, change it"

melcu · ‎2024-12-12

Hey Andy.

I know about that but I was thinking about something like "mvc" but for hardware. Beliveit or not, but the customer doesn't want a single packet or session to be lost 😞 Difficult one but it is what it is.

I already did this once with messing up the SND cores but it was a cluster with 7000 gateways doing about 2Mbps with "peak" at 8Mbps :))) I could afford to have 2 SND cores for everything.

This one is different though .. 5400's CPUs are screaming so I cannot mess with 9400 SND.

I think I will let them know that there will a little outage and that's it. Move traffic to the other site and do the hardware upgrade.

Thanks!

the_rock · ‎2024-12-12

Sadly, I doubt anyone can guarantee them they would not lose a single packet. Last time I did this, no packets were lots, though I always see one time out when we run constant ping.

Andy

Best,
Andy
"Have a great day and if its not, change it"

AkosBakos · ‎2024-12-12

I agree with Andy, we promise always 99,999% only.

Akos

----------------
\m/_(>_<)_\m/

the_rock · ‎2024-12-12

I think there is a saying in North America (well maybe more specifically USA, not sure here in Canada), but I think it says "Only 2 things in life are guaranteed...taxes and death". Though, thats true no matter where in the world you go lol

Andy

Best,
Andy
"Have a great day and if its not, change it"

melcu · ‎2024-12-12

Haha! That's a really good one!

Indeed, I've messed up a whole cluster in the middle of the day with a simple accelerated policy installation. Both members rebooted (kernel panic) at the same time! So .. nothing is guaranteed (beside what you've already indicated 🙂 )

AkosBakos · ‎2024-12-12

What was the version?

----------------
\m/_(>_<)_\m/

melcu · ‎2024-12-12

some R81 (not R81.x0 .. just R81) .. ancient times 🙂 First time when Accelerated Policy was implemented.

the_rock · ‎2024-12-12

As another saying goes "No point crying over spilled milk" as in to say all we can do is learn from our mistake and not repeat it again.

Thats it 🙂

Andy

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2024-12-12

Lets see...I messed up in the past with Fortinet, Palo Alto, Cisco, Check Point, haha. If life was perfect, none of us would have these jobs lol

Andy

Best,
Andy
"Have a great day and if its not, change it"

Chris_Atkinson · ‎2024-12-12

It is wise to not guarantee zero downtime for such a swap.

For awareness the devices also operate with different SecureXL modes by default.

CCSM R77/R80/ELITE

the_rock · ‎2024-12-12

Agree 100% 🙂

Best,
Andy
"Have a great day and if its not, change it"

JozkoMrkvicka · ‎2024-12-12

What is current version on 5000 cluster ?

If 9400 appliance has more cores than 5000 one, it should be better way to go. If naming of all configured interfaces will match between old and new member, then you should be able to disconnect old standby member from cluster (cpstop and/or shut all ports), connect new 9400 member, reset SIC, push policy and should go into standby state.

You can also disable checking out-of-state packets in Global Properties during the initial first failover.

Best option is to have new 9400 member configured in advance while using new cablings and do not play with cables during the change window. You will simple have new 9400 member cabled, but ports on switch (or fw) should be disabled/enabled depending where you want to work (old vs new member). During the replacement change itself you will just shut all ports on old member, enable on new member and thats it.

Kind regards,
Jozko Mrkvicka

the_rock · ‎2024-12-12

Since I literally keep all my emails and notes from ages ago, I checked one case back with a client in R76 days and they asked the TAC this same question...how to ensure they would not lose a single packet. Answer from TAC was that there was no one in Check Point that could give them guarantee for something like that.

Im 100% positive that even if you opened a case now days and ask them this, they would most likely tell you the same.

Best,

Andy

Best,
Andy
"Have a great day and if its not, change it"

_Val_ · ‎2024-12-13

I see there are lots of opinions already expressed here.

I just want to add a note from my personal experience. Even if you are confident you can perform an upgrade or HW migration with only minimal downtime, announce an extended service window interruption beforehand. Unexpected happens, even to the best of us.

It is always better to tell the business there will be a service interruption and manage the procedure without it than hope for the best and miss it because of a random contingency.

melcu · ‎2024-12-13

Hey Val

Of course! My usual window for this kind of stuff is 30 minutes and I like to tell my customers that even if I know everything will go smooth they still have to be aware that a full outage may occur in this time frame.

I did lots of replacements but this is the second time when I am asked to have "no downtime". It worked once 🙂 messing with SND cores but now it's not possible due to high traffic passing the gateways.

So in the end customer has to be aware that even a policy installation can go wrong!

the_rock · ‎2024-12-13

I would say 30 mins is a bit too short, maybe at least 60, or even 90 mins if possible.

Andy

Best,
Andy
"Have a great day and if its not, change it"

AkosBakos · ‎2024-12-13

I agree with Andy, and don't forget the revert process, and its time consumption.

If you stuck somewhere in the process (15min)-> you start to debug (30min) -> no success -> decision point (10min)- > decide by revert -> the revert process (30 min)

Akos

----------------
\m/_(>_<)_\m/

Are you a member of CheckMates?

No Downtime (Zero Downtime) hardware refresh