80.40 upgrade findings

Tommy_Forrest · ‎2020-08-21

Howdy everyone.

Past few weeks I've been prepping for this weekend's 80.40 upgrade.

Upgrading 2 MDS servers (3150's) from 80.30 to 80.40. Taking our 5150 SmartEvent and 5150 Log server along for the ride.

I've completed two full upgrade tests and a partial (primary MDS only) test in VMWare Workstation.

Here's how it works:

Backup primary and backup MDS servers (mds_backup -l).
Load up 80.30 JFT 155 in VM
Import the backups (SCP gtar, gzip, mds_restore and backup file)
Test the upgrade.

Here's some findings that I want to get out there into the Googleplex incase anyone else runs into the issues.

With 80.40, you've got to install the Upgrade tools package wrapper that matches the version of Gaia you want to install too. See SK135172 for more details. MAKE SURE YOU USE INTERNET EXPLORER TO DOWNLOAD THIS FILE. I yelled about that requirement. But got my re-end kicked up until I did. I'd been using Chrome and Firefox and the file just would not work. TAC came back with the IE requirement.

After getting it downloaded correctly I struggled with this on my lab primary MDS. The error would come up that a newer version was already installed even though none had been installed.

To solve this issue we had to reset the database (got this from TAC):

cpprod_util CPPROD_SetValue CPupgrade-tools-R80.40 BuildNumber 1 1 1

After issuing that command in BASH, I was able to successfully import the upgrade tools (I had this issue repeatedly on each test pass).

Also make sure you're using latest GA deployment agent.

The only other issue I ran into in the lab was a validation error after the upgrade completed and I started up SmartDashboard.

The error was related to the value Network defined by routes -> Set default update time value (seconds). In my production 80.30 environment that value is 0. 80.40 doesn't like that one bit.

To fix it (I strongly recommend doing this BEFORE upgrading to 80.40 - you must make sure your assignments are up to date prior to the upgrader being run):

From your MDS console click on Preferences -> Network defined by routes -> Set default update time value (seconds) to something that is not zero.

Publish your changes.

Open your global CMA click Manage & Settings -> Preferences -> Network defined by routes -> Set default update time value (seconds) to something that is not zero.

Publish your changes.

Go back to your MDS console -> Global Assignments

Assign Drop down -> Reassign

This worked like a champ on my 3rd pass of testing the upgrade.

Hopefully there are no weird gotchyas like bad optical NIC drivers like there was in 80.30 that bit me and we have smooth sailing come Saturday's production upgrade time!

Tal_Paz-Fridman · ‎2020-09-02

Update to the forum:

The issue encountered with the installation of Upgrade Tools is indeed an unfortunate one and we apologize for that.

Using mds_restore overrides registry values of the installed version. As a result, the Upgrade Tools installer believes the most up-do-date version was already installed.

This is something we are aware of it and have an internal task for it. Should be solved in an upcoming JHF.

When using an online installation (where the machine is connected to the Internet) the most up-to-date Upgrade Tools are downloaded and installed automatically.

Regarding the browser issue, there is no known limitation that we aware of, and the instruction to use Internet Explorer should have been offered as a suggestion.

If for some reason using a specific browser fails to work, it should be investigated separately.

Tommy_Forrest · ‎2020-09-02

Thanks for the reply.

When I hit time to upgrade production I found the tools had already been updated. Which was nice.

But something caused the import of the database to fail and we determined that there was not enough disk space marked free for the upgrade to continue. TAC + R&D have been looking at it.

It looks like I'll have to completely rebuild the box this next go around unless they can confirm a solution to free up more disk space.

I will say, the system did a FANTASTIC job of reverting the machine back. I had to step away for a few minutes and during that time the system failed the upgrade and reverted back. When I came back, I was scratching my head wondering if I had just had a dream where I'd started the upgrade. Investigation would soon reveal I had.

Tal_Paz-Fridman · ‎2020-09-02

Thank you for the update.

Our various tools should have warned if there wasn't enough disk space.

I'll take it up with the relevant owners.

Best wishes

Tal

Tommy_Forrest · ‎2020-09-02

There was plenty of space in /boot and /var/log. There wasn't enough space marked free.

The tools didn't pick up on that. Unfortunately. Though I'm not entirely sure that was the ultimate problem. In our case it is a problem because we don't have enough free space to snapshot the box.

I've mentioned this to that group, but I'll mention it here.

Apparently, per TAC, it is Check Point's recommendation to reserve enough space for 2 snapshots to be taken.

I'd like to see a world where the disk partitioner that comes up at Gaia install time is smart enough to figure out whether or not you're about to step on your toes by over allocating space to the required file systems thereby not leaving enough space to meet the recommended space for 2 snaps.

So, for example - you have 10TB usable. You tell the partitioner you want 8TB for logs and 1.5TB for everything else. That does not leave enough free space to meet the 2 snapshot best practice.

The partitioner should warn or refuse to partition the system with any space configuration that doesn't meet the $SYSTEM_SIZE*2 metric.

Lior_Manor · ‎2020-09-02

Hi Tommy,

If you are trying to run verify now on the system, does it still fail for lack of space?

Thanks,

Lior

Tommy_Forrest · ‎2020-09-02

Hi Lior.

No, the verifier has never failed due to lack of space.

Now, to be fair, I did delete the partial snapshot that was taken during the failed upgrade. I've got 26GB free and need 50GB to take a snapshot (there are no stored snaps currently). I did not try to run the verifier with that failed snapshot in place.

I just ran the verifier again it and it is permitting fresh and upgrade installs.

Lior_Manor · ‎2020-09-02

Thanks Tommy. From disk space perspective, if the verify succeeds, you have enough space to safely upgrade.

Are you a member of CheckMates?

80.40 upgrade findings