Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
jbeckner
Participant
Jump to solution

Cluster hardware core count change - prestaging or minimal downtime

Is a hardware quick failover possible with core count different between members?

I have a pair of 4800s (4 CPU cores) on R80.40 that are near EOL.

I have a pair of 6600s (6 CPU cores?) to replace them with.
I want to keep the IPs the same to avoid rewriting rules or licensing issues. 

Management is external on VMware, and it is an active/passive HA cluster. 
State Synchronization and Virtual MAC for Advanced Settings.
Internal networks also use this cluster as a router for network segmentation - default route for subnets are the Virtual IP.

My plan is to change/risk as little as possible to minimize downtime - match the Take and Hotfix version on new hardware.
All devices keep the same IP address in the Gateways and Servers page.

Reading a few other threads here, I was a little fuzzy on one critical item:
Is new hardware SIC/policy staging possible here, or quick failover possible with a core count change?

The procedure I saw in another thread was:
1) Power down/disconnect old member B
2) Power on/connect new member B
3) Change Cluster Hardware Type to 6600, Establish SIC on B, push policy to only B
     <A keeps traffic this whole time - "Maintain Current Cluster Member Active set" >
4) Power off A
5) B takes the floating IPs and there is minimal downtime, replace A afterward the same way

If I follow this and push policy to B and tell the cluster to keep the "Current Member Active",
do I really have a cluster at that point anyway?
State can't replicate to the new member because the core count is different? 
Does that create a "split brain" where both members want to be active and making things worse?

I really want to minimize downtime, and "power everything old off and figure out if the new hardware 
works correctly during a total outage" seems like if anything is wrong my outage is getting a lot longer.


I assume "B" has to be the real IP in the cluster setup to establish SIC and push policy, I can't do that on
an alternate IP address or anything. So the clock is ticking when I power things off.

Is there a way to get the real IP on member B with different core counts without the cluster going very wrong? 

 

0 Kudos
1 Solution

Accepted Solutions
Kaspars_Zibarts
Employee Employee
Employee

You can do exactly as you described. In addition to minimise the outage I normally allow out of state connections just before starting on both old members. This means when you push policy to the new HW, they will permit the same and it will help to maintain ongoing connections when you switch from old to new with different number of cores. Once both 6000s are in place you can disable OOS again.

View solution in original post

0 Kudos
15 Replies
Kaspars_Zibarts
Employee Employee
Employee

You can do exactly as you described. In addition to minimise the outage I normally allow out of state connections just before starting on both old members. This means when you push policy to the new HW, they will permit the same and it will help to maintain ongoing connections when you switch from old to new with different number of cores. Once both 6000s are in place you can disable OOS again.

0 Kudos
jbeckner
Participant

So disable "Use State Synchronization" a few hours before?

Old firewall A still has active connections, New B is in place...but will not have connections mirrored to it,
and will not go active.

Change member hardware type, push policy to just New B.

Pull the plug on Old A and then New B is active.

 

Willing to give it a try if that sounds like the right plan.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

No, you disable out of state drops 🙂 in global properties

global~2.png

0 Kudos
jbeckner
Participant

Perfect, thank you. Going to review this with my VAR and give this a shot next weekend. 

Appreciate the help and advice. 

May main heartburn right now is that the new devices are only showing the "Firewall" blade license,
not the full set of trial licenses or our first year contract license suite.
"cplic print -x" shows everything, but the GUI does no.

Hoping we plug in "new B", establish SIC, and the licenses show up.
Or at the very least the trial kicks in and those show up, before I move traffic to it.

I will check in after next weekend and let you know how it goes.

0 Kudos
Timothy_Hall
Legend Legend
Legend

Since the version on your old firewalls is at least R80.40, you may want to check out the relatively new Multi Version Cluster (MVC) method here: sk107042: ClusterXL upgrade methods and paths which can ensure perfect state sync at all times during the upgrade, and even avoid flaps if Dynamic Routing is in use on the firewalls.  

Kaspars' advice about disabling out of state checkboxes will work pretty nicely as well, and was how it was done prior to the introduction of MVC, or with firewalls that were R80.30 or earlier which used FCU instead.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
jbeckner
Participant

I will take a look at that then. I am not upgrading the version at all, just the hardware, but perhaps it will work the same way.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

I have to admit that I have not read upgrade manual for a while now but have a vague memory that MVC didn't support core change. But I might be wrong hehe

0 Kudos
shrestha
Participant

Hi  All,

I am planning to do a change with similar scenarios where the core numbers of the firewalls replacing the existing ones are different. The firewall version will  change from R80.10 to R80.20 (SMS is at R80.20). There will be no change in the IP addresses.

The process I was planning to follow are:

1) Failover to secondary Firewall B and make it active
2) Disconnect cables from old member A
2) Power on/connect new member A
3) Change version to R80.20, Establish SIC on A, push policy to only A
4) Do some basic checks like fw stat, fw tab –t connections
5) Cpstop in B (incase for quicker rollback no powerdown)
6) A takes the floating IPs and there is minimal downtime, replace B afterward in the same way
7) update the license from SMS


Do i need this to keep B active untill Cpstop is done?   <A keeps traffic this whole time - "Maintain Current Cluster Member Active set" >.

My main concern is not to have split brain untill i do cpstop in the B.

 

0 Kudos
_Val_
Admin
Admin

Before anything else, how long are you planning to stay on R80.20? Both R80.20 and R80.30 will be out of support in September. You need to plan to go all the way up to the recommended version (R81.10) ASAP. 

For your actual plan, if CoreXL settings are different, cluster members will not sync. Mind the cluster ID to avoid a split-brain situation. If the cluster ID is the same on the new member, it will cope up in the Ready state after the policy push, and will not be processing traffic. Only after cpstop on the older cluster member, it will go to active. All existing connections at this point will be cut and will have to be re-established.

jbeckner
Participant

My notes from the upgrade, following Kaspars advice on the "out of state" setting:

48 second downtime, two people - one to manage cables, one to push policy and observe.

  1. Global Properties -> Stateful Inspection -> Drop out of state TCP packets   - turned OFF
  2. Pulled all network cables on Old B (left cp services running - no cpstop so quick panic fallback possible)
  3. Plugged in only main/management NIC on New B to ping test and reset/establish SIC to B
  4. Adjusted topology on management for cable change (sync now dedicated cable new name) and hardware model
  5. Unplugged all network cables on Old A (was only active gateway) 
    ...outage...
  6. Plugged all B cables into New B
  7. Pushed policy (not threat) to cluster but unticked “fail if not all members available” so only really pushed to New B
  8. New B up running policy as sole member - 48 second downtime for the swap and push
  9. Plugged in only main/management NIC on New A, ping and establish SIC
  10. Plugged in all A cables to New A and pushed policy immediately to both with "push to all members or fail"
  11. Verified cluster membership and failover testing by marking each member down
  12. Push threat policy to both 
  13. Secondary round of failover testing with reboots of inactive members inbetween

 

Original plan was several minutes of downtime to pull all cables from both old gateways at once,
then cable all new gateways and proceed. But "cutting off" New B by only plugging in main NIC
prevents split brain. I can't see a cleaner or faster way to do this without getting into exotic
"what if" scenarios disabling services and possibly taking longer.

 

Licensing sorted itself out after initial push. Noted that in the build and migration document for future reference.
Before that the command line showed licensing, but the GUI on each gateway did not.

 

0 Kudos
jbeckner
Participant

I agree with Val, you need to upgrade soon, If you can take two separate outage windows, do the move to new hardware first then upgrade later to keep each step as simple as possible. Less variables/what if scenarios. 

0 Kudos
shrestha
Participant

Thanks Val. Yes, to decrease the change in variables we are planning to upgrade SMS and then the Firewalls after this hardware replacement is done.
Due to CoreXL settings being different,as long as the new firewall comes in Ready state once SIC is established
and policy is pushed and not processing traffic untill cpstop is done on active it should be good. We have got a small outage window for all existing connections to be re-established. The current FW are currently in R80.10 and i couldnt get the cluster_id info by putting the command "cphaconf cluster_id get"

Thanks Jbeckner, It seems unlike the original steps you mentioned where you were going to establish SIC and push policy on new B while A was active and failover to new B once its in Ready state,you have changed ur steps.
Basically you had a mininmal  outage starting from step 5 when you disconnected cables from Old A.
Was your concern not to do the previously mentioned steps because of probable split-brain due to hardware(core/cpu) difference when new B had the cables connected and the policy pushed?

0 Kudos
jbeckner
Participant

Was your concern not to do the previously mentioned steps because of probable split-brain due to hardware(core/cpu) difference when new B had the cables connected and the policy pushed?

Correct. I only do these things every several years, one firewall cluster shop here (and some smaller standalones). I did not want to get clever and wanted minimal risk, thinking and steps during the outage. Locking to not failover members in the GUI on push was probably going to be fine, but why risk it and it's one more thing to change then change back when my actual outage was less than a minute by having "B" ready for push in advance. Less steps = less risk.

Once I understood the note from Kaspars that was my step one:
Global Properties -> Stateful Inspection -> Drop out of state TCP packets   - turned OFF
then I decided the quick outage was acceptable and safer than trying to make sure I avoid split brain or any other cluster member or failover issues. If my packets are out of sync then whatever, just move on with the rule processing only, meaning most things will recover quickly.

Clean break - A is up still running traffic, prep B with SIC and test login/ping and you are as ready as you can get. Yank A cables and no possible chance of anything getting confused, only one member of cluster connected to cables/network at a time. No need to cpstop which can take a while. 
Only apply sync cable(s) from B to A when you are ready to make it a real gateway.

Final words of advice - make sure you have a local path to management and the gateways and the GUI tool installed on whatever system in advance and fully functional. For me that meant a laptop in the data center on the same switch, and I downloaded the GUI from the Management Webpage and...it did not work. GUI opened but touching the objects crashed on multiple laptops. Must be something missing or corrupt in the file the Management GUI provided.
After the cursing and panic as my window neared it's end, I grabbed the Smart Dashboard I downloaded from checkpoint.com directly and put that on the maintenance laptop and I was good. So think out all your physical and logical steps first and test everything in advance you can. And make sure you understand any topology changes - new interface names, etc. Label the cables the new names if changing in advance - one less thing to worry about when the tension is high. Have local serial console on all devices in case something goes wrong and so you can safely cpstop the disconnected devices a few days later once you are sure things are stable.

 

shrestha
Participant

Thanks for the detailed answer Jbeckner. I labbed up for two hardware with different CPU and they come up in ready stage (similar to when version upgrade is done).

I am slightly not clear about the cluster id bit as it seems to be pushed via SMS as i checked with the cphaprob mmagic command. I will read it through again. 

0 Kudos
_Val_
Admin
Admin

"Ready" state means the new cluster member cannot sync. This is a normal situation for any cluster where members' cores settings are not exactly the same. The older cluster member, however, should show Active status. When you stop it, your new cluster members will go to Active/Standby, assuming their CoreXL settings are identical.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events