Upgrade, Maestro, R80.20SP -> R81.10

vinceneil666 · ‎2021-09-23

Hi,

Have anyone tried this task successfully yet ?

We did one try yesterday and ended up with having to revert. Following the guide with all the correct patches. (the .315, the upgrade script..etc)

Anyone done this with success that has some tips n tricks ?

We pretty much halted while trying to fetch the policy

RamGuy239 · ‎2021-09-24

To summarize our attempt:

2x MHO-140 Orchestrators, running R80.20SP Take 295
4x CPAP-SG6500 appliances in Security Group 1, running R80.20SP Take 295 + Memleak Portfix.

Started with upgrading MHO-140 to R80.20SP Take 317. Then upgraded them to R81.10. Went flawlessly, no issue whatosever.

Then we moved to the gateways. We decided to start with half of them to begin with. They are four in total. 1_01, 1_02, 1_03 and 1_04. We did run clusterXL down on 1_01 and 1_02. Removed the memleak portfix, reboot, installed take 315, reboot, installed the R80.20SP Upgrade hotfix, reboot. Upgraded the deployment agent and imported the R81.10SP upgrade package.

All is fine thus far. The upgrade itself went well. But upon boot after the upgrade 1_01, 1_02 tossed 1_03 and 1_04 into ready state. All traffic was lost. It fixed itself after a few minutes. 1_01 and 1_02 was in down state as expected.

We change the object in Smart Console from R80.20SP to R81.10. Did the mgmt_cli command. Ran the sp_upgrade script and this is when we started having issues. The script was unable to fetch policy from the management. This command provides no information on what it's doing, where and why it might be failing. This obviously makes it very frustrating and difficult to troubleshoot.

After several attempts at getting the policy installed on the upgraded gateways we had to roll-back using snapshots.

Afterwards we noticed the existence of sk174844. It almost seems like this SK is mandatory? It claims to be relevant for R80.20SP, R80.30SP and R81SP. In other words, it's mandatory regardless of the enviroment? We find it strange that there are no references to this SK in the R81.10 Maestro admin guide or the original SK regarding upgrades (sk173363). The timestamp for sk174844 is 2021-08-03 meaning it was created weeks before the hotfix for R80.20SP, R80.30SP and R81SP was created that was required for doing a upgrade in the first place.

This makes it quite strange to not have any references to sk174844 in the admin guide or in sk173363. Seems like our upgrade was doomed to fail for the get-go as we had no inforrmation of sk174844 so the fetch was never going to work?

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Tal_Ben_Avraham · ‎2021-09-29

Hi @vinceneil666 / @RamGuy239 , we had many successful upgrades.

Indeed you are correct and sk174844 should be mentioned in admin guide. If its not that's a bug in our documentation.

Is that the case in your environment? Data link used to manage the Security GW ("The Management Server that manages this Security Group is connected to the Maestro Orchestrator through an Uplink port.").

We also have few customers following this procedure successfully and already running with R81.10 in production.

RamGuy239 · ‎2021-09-29

Hi, @Tal_Ben_Avraham.

I suppose the management traffic is going over an uplink port as the SGM's are reaching the management using its public IP. There is a magg interface and the management is reachable in the magg subnet but this is not the main IP of the management server so the sp_upgrade script is using the wrong IP for fetching.

We did try to install the hotfix from sk174844 but it was still unable to fetch using the public IP of the management server. We had to manually edit the sp_upgrade script overriding it with the correct IP so it was using the secondary IP of the management server, the one on the magg subnet and then it was working.

The upgrade itself was rather messy. Was a lot of unstable traffic during the process. But once all members were running R81.10 + Take 9 it become more stable and everything is looking good now.

The admin guide doesn't really tell much about what instabilities to expect when some members are successfully running R81.10 while the rest has yet to be upgraded.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

RamGuy239 · ‎2021-09-30

After completing the upgrade I have a few questions. We've been told that best practice for Maestro is to have all the management traffic using the magg interface. My question is how are we intended on designing this?

This particular customer has everything ready. Their management server, their dedicated log server and their dedicated smart event server all have a secondary interface within the magg subnet. Maestro is fully capable of reaching all three servers using its magg interface and subnet. But it won't. And the reason why it won't is quite simple. The Main IP in Smart Console for the management server, the log server and the smart event server is using the IP of eth0 and not eth1. eth0 on all three contains the public IP, eth1 contains the magg IP.

As the customer is managing a ton of firewalls using this management they can't have the Main IP being the magg interface as that would result in all their other managed gateways failing and not being able to reach the management server, log server or the smart event server. We also know from experience with another customer that we can't simply remove eth0 and have the magg IP be the only IP and use 1:1 NAT for the public IP. As a result of the correction layer and whatnot, we have been told that the only management traffic that should be using the magg subnet is the management traffic of the Maestro deployment, not the management traffic from other managed gateways.

In other words, we need to have two separate interfaces and subnets on the management installations? Correct?

How are we supposed to ensure that Maestro is using the magg subnet while all others are using the public subnet? I have various tricks up my sleeve that might solve it but none of them is very sexy. We could jump into GUIDBedit and tell policy push to not override $FWDIR/conf/masters and override it on the Maestro gateways so they are pointing directly to the correct IP's within the magg subnet via the masters-file. I'm not entirely sure if this is supported on Maestro, but it should work but this wouldn't help the sp_upgrade script as that one is fetching from the registry and the registry would still contain the public IP?

Another solution would be to follow the old logic from R77.xx by creating dummy objects. We could create a dummy mgmt object. One for the management server and one for the log server and change the settings on the Maestro object in Smart Console pointing it towards the dummy objects containing the correct magg IP addresses instead of using the actual objects containing the public IP. But I've been told that dummy objects should not be used on R8X.XX as it gives many cosmetic errors in Smart Console so this solution isn't very sexy either.

It would be interesting to know how we could achieve a working environment where Maestro is using the magg subnet, whereas all the rest is using the public subnet. If this was achieved and fully working before starting the upgrade it would have worked correctly from the get-go.

We would also like to know why we couldn't fetch using the public IP via uplink even after applying the hotfix from sk174844. It didn't seem to change anything for us. I have another Maestro environment that I'm going to upgrade in a few weeks and this solution doesn't have the management available via magg at all so I couldn't do the same workaround by editing the sp_upgrade script. I would need to have the fetch working via uplink for it all to work. Or else we would need to redesign before doing the upgrade to make sure that I can reach the management via the magg subnet before starting the upgrade.

Last but not least. Why did we experience so much instability during the upgrade process? Once 1_01 and 1_02 were running R81.10 and was able to fetch policy and the failover happened things weren't stable at all. It wasn't until the remaining members were also running R81.10 it got stabilised. Is this to be expected? The admin guide or sk173363 doesn't give us any pointers on whether we should expect things to be unstable after the failover or not. The sp_upgrade script even asks us if we want to stop here and continue later. This gives us the impression that things should be stable after the failover so we could do some connectivity testing before upgrading the remaining members but that was out of the question for us as things weren't stable and looking at the traffic logs it seemed like sessions were still heading through 1_03 and 1_04 even though they were still running R80.20SP and was put in down state by the upgrade script. As soon as the remaining members booted up on R81.10 it got much better.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Simon_Macpherso · ‎2022-07-26

How have your other upgrade experiences been?

Was it better when upgrading an environment that doesn't have a magg bond and uses a bonded uplink for SMO management?

Apparently there is no requirement to make use of a dedicated management interface and you can utilize the bonded external interface to manage the Security Group. The management interface is required when running the Maestro in VSX mode. But in gateway mode any interface can be used as management. There still doesn't appear to be much documentation about this.

Tal_Ben_Avraham · ‎2021-10-03

Hi @RamGuy239 ,

There may be traffic impact as connections won't be surviving fail-over between versions.

Other than that there shouldn't be an impact.

In future versions we will support also connection fail-over and would be able to upgrade member by member.

Tal_Ben_Avraham · ‎2021-10-03

As for upgrade scenario.

As this configuration is quite unique we will discuss offline.

Baasanjargal_Ts · ‎2021-12-23

Hello.

I am trying to install latest Deployment agent on R80.30SP version security group. According to the admin guide it should be used below command. But this command gives 'Invalid command' error. Is there any other command? or what is wrong on here?

#update_sp_da /Full Path>/<Name of CPUSE Deployment Agent Package>

ndoey · ‎2023-06-19

we are in hte same boat today with this.

frankcar · ‎2023-06-21

update_sp_da comes from installing the R80.30SP jumbo take then once thats installed you can use the update_sp_da command successfully.

RamGuy239 · ‎2022-01-11

Hi, @Tal_Ben_Avraham.

As you know I'm quite experienced with upgrading Maestro at this point. I just have a few questions regarding Maestro VSX as this will be my first attempt at upgrading an environment running R80.20SP with MHO-140 and CPAP-SG6500 where VSX has is running on top of Maestro.

I have plenty of experience with both Maestro and VSX. But not so much when they are running in combination with each other. Reading through the R81.10 installation and upgrade guide, the Maestro R81.10 Administration Guide and the Quantum Scalable Platforms VSX R81.10 Administration Guide doesn't give me much.

None of them is mentioning anything specific in regards to upgrading to R81.10. I suspect this means that the upgrade itself is pretty much the exact same procedure? I know the sp_upgrade script asks if the environment is VSX so I guess the script will just take this into account and do everything it has to do? Making it "business as usual" in terms of R80.20SP -> R81.10 Maestro upgrade?

In a regular VSX environment, I would normally go for a clean install. As the VSX configuration lives on the management this makes perfect sense for regular VSX deployments. Yet again the various administration guides do not tell me much about how this would work on Maestro?

What would happen if I run a vsx_util upgrade, remove all members from the security group, do a clean install using ISO/USB, add them back into the security group and run a vsx_util reconfigure? I can't see why this shouldn't work, but again it's Maestro so I've learned to not expect things to behave the same way as on regular deployments.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Trevor_Bruss · ‎2021-12-13

Realize this thread is a little old, but I've been struggling with an upgrade myself. Started with the Orchestrators which had zero issues. Not much to them in either case. It's when I got to the gateways themselves that I'm stuck. Documentation mentions Take 315 and a hotfix that goes with it. Problem is, I'm on the latest general Take which is 317. So does that mean I have to downgrade my install. Reason I wonder is because I can't install the hotfix, clearly because my take is not what it was developed on. So skipping that step, I followed the steps to update the Deployment agent. That didn't go great initially. It behaved as if one of the gateways didn't get the update initially. Running show installer status build failed for awhile, and then the next day when I ran show installer status build. all the members showed the correct build.

Moving on to the actual upgrade package, I get errors trying to import it on my security group. Get [ERROR] Failed to retrieve package information. I even noticed that part way through this migration process they've come out with a newer take of this update. No difference however. The import always fails.

I haven't even gotten to the problems others mentioned in this thread. Looks like I'll need to open a case just to get to the bottom of this. I sure hope upgrading these is worth the headache.

emmap · ‎2021-12-13

Downgrading the JHF and installing the upgrade hotfix is a mandatory step.

Tal_Ben_Avraham · ‎2021-12-13

Indeed upgrade HF is mandatory (and new GA package should be used). Do you have an issue downgrading JHF. If you have a dual-site environment downgrading may not be simple (as uninstalling R80.20SP JHF removes the dual-site support), but if you don't downgrade shouldn't be problematic.

In any case, a JHF containing all those fixes is expected to be released this month if you would like to avoid downgrading JHF.

Other than that, if any issue is encountered please open a support ticket.

Trevor_Bruss · ‎2021-12-14

No issues downgrading. Just going to take a little longer and will require multiple reboots. Been a little shy in the past because we've had issues with upgrades/reboots on these units where they don't come back. Thanks!

Trevor_Bruss · ‎2021-12-14

Downgrade went well. I was able to get all the way to installing the upgrade on one member of the security group using the newer Take 358 package. However, I'm failing during the sp_upgrade portion when it tries to fetch policy. The command I ran on the management server as stated by the sp_update started to load the policy from what I could see when I had SmartUpdate running, but it failed to install the policy with the generic: Policy installation had failed due to an internal error.

It appears that I can't perform a clean sp_update --revert without pulling the unit from the security group. So now I'm presently stuck with a security group running with only one active member at the moment. And my support case I originally opened, the engineer pointed me to an SK to install the deployment agent for the normal versions of the OS, and specifically the SK stated not to run this on Maestro. Needless to say, I did not follow his advice.

I'm at a loss for why the mgmt_cli command is failing with the generic error, and it may be that the sp_update when it gets to the step to fetch the policy is failing due to that failure. Who could say?! Just frustrated. I've never had much luck with upgrading these Maestro units. It's always been a struggle for whatever reason.

Tal_Ben_Avraham · ‎2021-12-14

@Trevor_Bruss - mgmt operations shouldn't fail. I will contact you to figure this out.

frankcar · ‎2023-06-21

we had similar issue when importing packages "Failed to retrieve package information." , until we licensed our gateways in maestro now this message is cleared and import works fine.

frankcar · ‎2023-10-24

Hi, I found you get this if you dont copy files to all SG before import as when its imported into 1 member of the SG its creates some txt file

so I normally use the asg_cp2blades put the files on all members before using import commands as below

asg_cp2blades filename and then /var/tmp to send to all members. then you don't get this failed to retrieve package information.

sorry if you know this.

Simon_Macpherso · ‎2022-07-26

Hello all,

I also ran in to this today upgrading the following environment;

2 x MHO 140 R80.20SP take 327
2 x SGM 6400 R80.30SP take 97

We are using image take 338 for deployment, which per sk176388 includes the fix for the policy fetch issue.

Instead of upgrading both MHOs and moving on to SGMs, following TAC advise to upgrade one MHO and SGM at a time.

First MHO upgraded successfully to r81.10 with latest GA take 66.

First SGM upgraded to r81.10 fine, but ran in to the same policy fetch issue when running through the sp_upgrade procedure.

Similar to @RamGuy239 above, we also have an uplink port for internet traffic and a magg bond (magg0) for management traffic, which is the main IP of the SMO in Smart Console.

I haven't rolled back the MHO and gateway at this stage - we still have a TAC case open.

@Anatoly have you observed this before and aware of a fix?

Regards,

Simon

Tal_Ben_Avraham · ‎2022-07-26

Hi Simon,

It is recommended to do the upgrade in 2 phases (and possibly 2 windows):

1. MHOs upgrade.

2. SGMs upgrade

If you are currently running with 2 MHOs of 2 different versions active I would highly recommend getting those upgraded ASAP to R81.10.

In regards to policy fetch error. Is it a possibility your Security Group is connected to MGMT thru LACP link? If that is the case, it explains the failure as this is a limitation (see sk178045). To solve this you can contact support and ask them to copy policy files manually (if they aren't aware of such procedure please refer them to sk178045 - they should have the info on how to do so there.

Thanks,

Tal

Simon_Macpherso · ‎2022-07-26

Hi @Tal_Ben_Avraham

Thanks for your response.

Odd that you've stated the recommended upgrade method is 2 phases, starting with both MHOs.

This conflicts with what the TAC recommended after they check internally.

I agree running the MHOs in 2 different versions is not good. We aren't experiencing any observable traffic issues this morning however I intend to follow your suggested re 2 phases and will be upgrading the other MHO today.

Re the policy fetch error, the setting on the magg1 bond is XOR not 802.3ad (see attached).

Regards,

Simon

Simon_Macpherso · ‎2022-07-26

Odd, in the web ui it shows magg1 configured as XOR.

But if I run cat /proc/net/bonding/magg1 on the SMO the bonding mode is LACP.

Wonder if there is requirement to configure this via the cli.

Our external switches only support LACP and PAgP so well need to get TAC to provide the procedure to copy the policy files manually.

Simon_Macpherso · ‎2022-07-27

Hi @Tal_Ben_Avraham

Is it also possible we need to upgrade SGM_1 first i.e. the master?

When changing the version of the smart console object to R81.10 and publishing, it seems to retain version R80.30SP and displays a yellow exclamation next to the version. When I object the SMO object, it displays a message saying 'new version detected. Changing R81.10 to R80.30SP"

Could it be pulling this info from SGM_1 (master) as that SGM is still on R80.30SP. SGM_2 was upgraded.

Regards,

Simon

Simon_Macpherso · ‎2022-07-27

So the policy.info file in $FWDIR/state/<gw_name>/ actually shows 6.0.5.2 which is compiled for R81.10, It just doesn't reflect in the SMO object version in Smart Console.

We also discovered there is an issue with the pre upgrade verifier script. The script is supposed to to detect if the mode on the bond that is used to route to the management server.is LACP. If it detects that it is, it reports a failure. It's bash logic is not detected the mode as expected and in this case it returned that LACP wasn't enabled. R&D have acknowledged the issue and will provide a fix.

RamGuy239 · ‎2023-06-23

Luckily, the most challenging part about Maestro upgrades is getting from R80.20SP or R80.30SP to R81.10+. From there, Maestro is no longer this awkward thing with all its strange limitations. I have upgraded multiple Maestro installations from R80.20SP to R81.10, and I did my two first upgrades before the initial R81.10 Maestro upgrade images got pulled. That was an experience, and I had to edit the upgrade script to get the upgrade completed manually.

I have done one upgrade from R81.10 to R81.20, and it was pretty much identical to regular upgrades, besides the fact that you still have to utilise the sp_upgrade script. But there are no more hoops, and it utilises multi-version clustering etc. It makes it a much better experience.

I don't think XOR is a requirement anymore. I recently changed an R81.10 Single-Site Maestro VSX deployment into an R81.20 Dual-Site Maestro VSX environment, and we changed magg from XOR to LACP. We also created LACP for external SYNC as well. It didn't give us any issues, and the environment has been running fine since the change.

One issue I ran into with the change from Single-Site to Dual-Site was related to the mix-match of appliances. Site 1 is running 3x CPAP-SG6600 appliance, while Site 2 is running 2x CPAP-SG6500 appliance, so I had to disable CoreXL Split / CoreXL Dynamic Balancing. It's supposed to work and be supported running R81.20 Maestro, but Site 2 would be stuck in down state unless I disabled this feature. This is most likely an issue with running mix-match and CoreXL not enjoying the fact that the appliances on each site having different CPUs and number of cores.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

AmitShmuel · ‎2023-06-28

Hi,

Dynamic Balancing should work in this mode.

If this wasn't resolved yet, feel free to contact me offline at amitshm@checkpoint.com

Thanks,
Amit

Dario_Perez · ‎2023-06-28

Dynamic balancing for maestro is not supported in R81.10, start to be supported on R81.20

RamGuy239 · ‎2023-06-28

This is a mix-match Maestro VSX running R81.20.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME