Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
melcu
Participant
Participant
Jump to solution

No Downtime (Zero Downtime) hardware refresh

Hi Mates,

I need an advice from you, experts!
One of my customers is going to upgrade some 5000 something gateways in a classic HA  A-P cluster  to some brand new 9400 gateways.

How in the world can I do this with zero downtime without messing with SND cores (as there's no way to revert back to 20/24 cores (or how many 9400 has) without , again, downtime!vI mean I could just  change the number of SND cores to one 9400,  join it in cluster  (5x00 + 9400 with lowered cores) and it will be just fine, but errr ... my brain is in a boot loop and I can't figure it out!

The only way I see it is to remove standby member of actual 5000 cluster,  add the new 9400 gateway and try to be flash fast to disable clusterXL on 5000 when 9400 becomes active.

 

Any ideas ? (will be highly appreciated).

 

Thanks

0 Kudos
1 Solution

Accepted Solutions
the_rock
Legend
Legend

Sadly, I doubt anyone can guarantee them they would not lose a single packet. Last time I did this, no packets were lots, though I always see one time out when we run constant ping.

Andy

View solution in original post

18 Replies
the_rock
Legend
Legend

I would follow below process. I had done it many times and no issues,

Best,

Andy

https://community.checkpoint.com/t5/Security-Gateways/Replace-Upgrade-Cluster/m-p/157228#M27268

0 Kudos
melcu
Participant
Participant

Hey Andy.

I know about that but I was thinking about something like "mvc" but for hardware. Beliveit or not, but the customer doesn't want a single packet or session to be lost 😞  Difficult one but it is what it is.

I already did this once with messing up the SND cores but it was a cluster with 7000 gateways doing about 2Mbps with "peak" at 8Mbps :))) I could afford to have 2 SND cores for everything.

This one is different though .. 5400's CPUs are screaming so I cannot mess with 9400 SND.

I think I will let them know that there will a little outage and that's it. Move traffic to the other site and do the hardware upgrade.

 

Thanks!

0 Kudos
the_rock
Legend
Legend

Sadly, I doubt anyone can guarantee them they would not lose a single packet. Last time I did this, no packets were lots, though I always see one time out when we run constant ping.

Andy

AkosBakos
Advisor
Advisor

I agree with Andy, we promise always 99,999% only.

Akos

----------------
\m/_(>_<)_\m/
(1)
the_rock
Legend
Legend

I think there is a saying in North America (well maybe more specifically USA, not sure here in Canada), but I think it says "Only 2 things in life are guaranteed...taxes and death". Though, thats true no matter where in the world you go lol

Andy

0 Kudos
melcu
Participant
Participant

Haha! That's a really good one!

Indeed, I've messed up a whole cluster in the middle of the day with a simple accelerated policy installation. Both members rebooted (kernel panic) at the same time! So .. nothing is guaranteed (beside what you've already indicated 🙂 )

AkosBakos
Advisor
Advisor

What was the version?

----------------
\m/_(>_<)_\m/
0 Kudos
melcu
Participant
Participant

some R81 (not R81.x0 .. just R81) .. ancient times 🙂 First time when Accelerated Policy was implemented.

the_rock
Legend
Legend

As another saying goes "No point crying over spilled milk" as in to say all we can do is learn from our mistake and not repeat it again.

Thats it 🙂

Andy

0 Kudos
the_rock
Legend
Legend

Lets see...I messed up in the past with Fortinet, Palo Alto, Cisco, Check Point, haha. If life was perfect, none of us would have these jobs lol

Andy

Chris_Atkinson
Employee Employee
Employee

It is wise to not guarantee zero downtime for such a swap.

For awareness the devices also operate with different SecureXL modes by default.

CCSM R77/R80/ELITE
0 Kudos
the_rock
Legend
Legend

Agree 100% 🙂

0 Kudos
JozkoMrkvicka
Authority
Authority

What is current version on 5000 cluster ?

If 9400 appliance has more cores than 5000 one, it should be better way to go. If naming of all configured interfaces will match between old and new member, then you should be able to disconnect old standby member from cluster (cpstop and/or shut all ports), connect new 9400 member, reset SIC, push policy and should go into standby state.

You can also disable checking out-of-state packets in Global Properties during the initial first failover.

Best option is to have new 9400 member configured in advance while using new cablings and do not play with cables during the change window. You will simple have new 9400 member cabled, but ports on switch (or fw) should be disabled/enabled depending where you want to work (old vs new member). During the replacement change itself you will just shut all ports on old member, enable on new member and thats it.

Kind regards,
Jozko Mrkvicka
0 Kudos
the_rock
Legend
Legend

Since I literally keep all my emails and notes from ages ago, I checked one case back with a client in R76 days and they asked the TAC this same question...how to ensure they would not lose a single packet. Answer from TAC was that there was no one in Check Point that could give them guarantee for something like that.

Im 100% positive that even if you opened a case now days and ask them this, they would most likely tell you the same.

Best,

Andy

0 Kudos
_Val_
Admin
Admin

I see there are lots of opinions already expressed here. 

I just want to add a note from my personal experience. Even if you are confident you can perform an upgrade or HW migration with only minimal downtime, announce an extended service window interruption beforehand. Unexpected happens, even to the best of us. 

It is always better to tell the business there will be a service interruption and manage the procedure without it than hope for the best and miss it because of a random contingency. 

0 Kudos
melcu
Participant
Participant

Hey Val

Of course!  My usual window for this kind of stuff is 30 minutes and I like to tell my customers that even if I know everything will go smooth they still have to be aware that a full outage may occur in this time frame.

I did lots of replacements but this is the second time when I am asked to have "no downtime". It worked once 🙂 messing with SND cores  but now it's not possible due to high traffic passing the gateways.

So in the end customer has to be aware that even a policy installation can go wrong!

the_rock
Legend
Legend

I would say 30 mins is a bit too short, maybe at least 60, or even 90 mins if possible.

Andy

0 Kudos
AkosBakos
Advisor
Advisor

I agree with Andy, and don't forget the revert process, and its time consumption. 

If you stuck somewhere in the process (15min)-> you start to debug (30min) -> no success -> decision point (10min)- > decide by revert -> the revert process (30 min)

Akos

----------------
\m/_(>_<)_\m/
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events