Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
vinceneil666
Advisor

Upgrade, Maestro, R80.20SP -> R81.10

Hi,

Have anyone tried this task successfully yet ? 

We did one try yesterday and ended up with having to revert. Following the guide with all the correct patches. (the .315, the upgrade script..etc)

Anyone done this with success that has some tips n tricks ? 

 

We pretty much halted while trying to fetch the policy

0 Kudos
6 Replies
RamGuy239
Advisor

To summarize our attempt:

2x MHO-140 Orchestrators, running R80.20SP Take 295
4x CPAP-SG6500 appliances in Security Group 1, running R80.20SP Take 295 + Memleak Portfix.

Started with upgrading MHO-140 to R80.20SP Take 317. Then upgraded them to R81.10. Went flawlessly, no issue whatosever.


Then we moved to the gateways. We decided to start with half of them to begin with. They are four in total. 1_01, 1_02, 1_03 and 1_04. We did run clusterXL down on 1_01 and 1_02. Removed the memleak portfix, reboot, installed take 315, reboot, installed the R80.20SP Upgrade hotfix, reboot. Upgraded the deployment agent and imported the R81.10SP upgrade package.

All is fine thus far. The upgrade itself went well. But upon boot after the upgrade 1_01, 1_02 tossed 1_03 and 1_04 into ready state. All traffic was lost. It fixed itself after a few minutes. 1_01 and 1_02 was in down state as expected.


We change the object in Smart Console from R80.20SP to R81.10. Did the mgmt_cli command. Ran the sp_upgrade script and this is when we started having issues. The script was unable to fetch policy from the management. This command provides no information on what it's doing, where and why it might be failing. This obviously makes it very frustrating and difficult to troubleshoot.

 

After several attempts at getting the policy installed on the upgraded gateways we had to roll-back using snapshots.

 

Afterwards we noticed the existence of sk174844. It almost seems like this SK is mandatory? It claims to be relevant for R80.20SP, R80.30SP and R81SP. In other words, it's mandatory regardless of the enviroment? We find it strange that there are no references to this SK in the R81.10 Maestro admin guide or the original SK regarding upgrades (sk173363). The timestamp for sk174844 is 2021-08-03 meaning it was created weeks before the hotfix for R80.20SP, R80.30SP and R81SP was created that was required for doing a upgrade in the first place.

This makes it quite strange to not have any references to sk174844 in the admin guide or in sk173363. Seems like our upgrade was doomed to fail for the get-go as we had no inforrmation of sk174844 so the fetch was never going to work?

0 Kudos
Tal_Ben_Avraham
Employee
Employee

Hi @vinceneil666  / @RamGuy239 , we had many successful upgrades.

Indeed you are correct and sk174844 should be mentioned in admin guide. If its not that's a bug in our documentation.

Is that the case in your environment? Data link used to manage the Security GW ("The Management Server that manages this Security Group is connected to the Maestro Orchestrator through an Uplink port.").

We also have few customers following this procedure successfully and already running with R81.10 in production.

0 Kudos
RamGuy239
Advisor

Hi, @Tal_Ben_Avraham.

I suppose the management traffic is going over an uplink port as the SGM's are reaching the management using its public IP. There is a magg interface and the management is reachable in the magg subnet but this is not the main IP of the management server so the sp_upgrade script is using the wrong IP for fetching.

We did try to install the hotfix from sk174844 but it was still unable to fetch using the public IP of the management server. We had to manually edit the sp_upgrade script overriding it with the correct IP so it was using the secondary IP of the management server, the one on the magg subnet and then it was working.

The upgrade itself was rather messy. Was a lot of unstable traffic during the process. But once all members were running R81.10 + Take 9 it become more stable and everything is looking good now.

The admin guide doesn't really tell much about what instabilities to expect when some members are successfully running R81.10 while the rest has yet to be upgraded.

0 Kudos
RamGuy239
Advisor

After completing the upgrade I have a few questions. We've been told that best practice for Maestro is to have all the management traffic using the magg interface. My question is how are we intended on designing this?

This particular customer has everything ready. Their management server, their dedicated log server and their dedicated smart event server all have a secondary interface within the magg subnet. Maestro is fully capable of reaching all three servers using its magg interface and subnet. But it won't. And the reason why it won't is quite simple. The Main IP in Smart Console for the management server, the log server and the smart event server is using the IP of eth0 and not eth1. eth0 on all three contains the public IP, eth1 contains the magg IP.

As the customer is managing a ton of firewalls using this management they can't have the Main IP being the magg interface as that would result in all their other managed gateways failing and not being able to reach the management server, log server or the smart event server. We also know from experience with another customer that we can't simply remove eth0 and have the magg IP be the only IP and use 1:1 NAT for the public IP. As a result of the correction layer and whatnot, we have been told that the only management traffic that should be using the magg subnet is the management traffic of the Maestro deployment, not the management traffic from other managed gateways.

In other words, we need to have two separate interfaces and subnets on the management installations? Correct?

How are we supposed to ensure that Maestro is using the magg subnet while all others are using the public subnet? I have various tricks up my sleeve that might solve it but none of them is very sexy. We could jump into GUIDBedit and tell policy push to not override $FWDIR/conf/masters and override it on the Maestro gateways so they are pointing directly to the correct IP's within the magg subnet via the masters-file. I'm not entirely sure if this is supported on Maestro, but it should work but this wouldn't help the sp_upgrade script as that one is fetching from the registry and the registry would still contain the public IP?

Another solution would be to follow the old logic from R77.xx by creating dummy objects. We could create a dummy mgmt object. One for the management server and one for the log server and change the settings on the Maestro object in Smart Console pointing it towards the dummy objects containing the correct magg IP addresses instead of using the actual objects containing the public IP. But I've been told that dummy objects should not be used on R8X.XX as it gives many cosmetic errors in Smart Console so this solution isn't very sexy either.

It would be interesting to know how we could achieve a working environment where Maestro is using the magg subnet, whereas all the rest is using the public subnet. If this was achieved and fully working before starting the upgrade it would have worked correctly from the get-go.


We would also like to know why we couldn't fetch using the public IP via uplink even after applying the hotfix from sk174844. It didn't seem to change anything for us. I have another Maestro environment that I'm going to upgrade in a few weeks and this solution doesn't have the management available via magg at all so I couldn't do the same workaround by editing the sp_upgrade script. I would need to have the fetch working via uplink for it all to work. Or else we would need to redesign before doing the upgrade to make sure that I can reach the management via the magg subnet before starting the upgrade.

Last but not least. Why did we experience so much instability during the upgrade process? Once 1_01 and 1_02 were running R81.10 and was able to fetch policy and the failover happened things weren't stable at all. It wasn't until the remaining members were also running R81.10 it got stabilised. Is this to be expected? The admin guide or sk173363 doesn't give us any pointers on whether we should expect things to be unstable after the failover or not. The sp_upgrade script even asks us if we want to stop here and continue later. This gives us the impression that things should be stable after the failover so we could do some connectivity testing before upgrading the remaining members but that was out of the question for us as things weren't stable and looking at the traffic logs it seemed like sessions were still heading through 1_03 and 1_04 even though they were still running R80.20SP and was put in down state by the upgrade script. As soon as the remaining members booted up on R81.10 it got much better.

0 Kudos
Tal_Ben_Avraham
Employee
Employee

Hi @RamGuy239 ,

There may be traffic impact as connections won't be surviving fail-over between versions.

Other than that there shouldn't be an impact.

In future versions we will support also connection fail-over and would be able to upgrade member by member.

 

0 Kudos
Tal_Ben_Avraham
Employee
Employee

As for upgrade scenario.

As this configuration is quite unique we will discuss offline.

0 Kudos