R80.40 cluster blink upgrade gone wrong

adina · ‎2021-03-28

I have performed an R80.40 upgrade on a R80.30 clusterXL the other day using a Blink Package as the Major Versions package wasn’t available for download. I followed the steps in sk92449, however upon upgrading the first gateway I started noticing some issues. I started with the standby member and after the upgrade it came back as active and it was conflicting with the other gateway that was also active.
Is this standard behavior or have I missed something?

Thanks

PhoneBoy · ‎2021-03-28

If you just did the CPUSE upgrade and didn’t take any additional steps, I can see how you’d run into what you did.
There are a few different things you can do prior to the upgrade to ensure a much smoother transition, depending on the level of downtime that is acceptable.
See: https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...,

adina · ‎2021-03-28

Thank you for your quick response. This was one of the articles I referred to when planning my upgrade. I can't see anything in there that suggests I need to take any additional steps to prevent an ‘Active - Active’ state. I have also tested this in a lab environment both before and after the change and couldn’t replicate the issue.
I’m trying to work out if I have made a mistake as I am struggling to understand where I’ve made it, based on the documentation I have read and the test results.

Alex_Shpilman · ‎2021-03-28

Hi @adina ,

Did you perform an upgrade or fresh install using Blink?

Did you enable MVC?

Did you push policy for the upgraded member to obtain the interfaces, topology, clusterXL, etc?

I typically would shutdown the upstream switch ports to prevent active/active, leave just the sync and management ports up, perform the upgrade/fresh install using the desired mode, establish SIC (if required), install license (if required), push policy, enable MVC (if supported for that particular upgrade path), check the Cluster status and re-enable the upstream switchports if all looks healthy.

Cheers.

adina · ‎2021-03-29

Hi @Alex_Shpilman,

I did an upgrade not a clean install. Normally I would have used the Major Versions package, however for the gateways it wasn’t available for download. I’ve considered downloading and manually importing it but in the end I decided to go for the Blink image.

These are the steps that I did:
- Snapshot the appliance and export the snapshots to a secure external location.
- Take a backup of the gaia configuration.
- Check for updates.
- Download the Blink image on the Standby gateway (which is also the gateway withe the lowest priority in the cluster).
- Verify the package.
- Start the upgrade.
- After the new version finished installing and the appliance rebooted it came back as active. I didn’t get the chance to push the policy or do anything else.

To fix it I just did a cphastop on the upgraded gateway, pushed the policy, enabled mvc and continued with the upgrade.

In my lab I followed the same steps, however after the reboot the firewall came back as ‘Ready’ every single time I’ve tried it.

the_rock · ‎2021-03-29

Hm...those steps do make sense, BUT...here is something I have problem with. If you follow below (specially section for zero downtime upgrade page 133, it outlines exact steps...I did this many times and it never failed)

https://dl3.checkpoint.com/paid/bf/bf5b38d9c193fca29b572bd4f77fa07e/CP_R80.10_Installation_and_Upgra...

Is that what you followed in your lab?

Andy

Best,
Andy

adina · ‎2021-03-31

Thanks for the response Andy, I didn’t follow steps 1 or 2 in your referenced guide in either the lab or the production environments. However I have never followed these steps and I have never had this problem before.

The rest of the procedure looks pretty much spot on and is how I would normally so this. As soon as I did step 3 the experience was not what I believe should have been.

Alex_Shpilman · ‎2021-03-29

Hi @adina,

I believe your sequence was according to the outlined steps in the MVC, however, the official documentation is specifying upgrade/clean install as per sk92449 , which is not specifying Blink.

As a precaution, I usually shutdown the data interfaces to prevent active/active, in case the upgraded member comes up with no ClusterXL configuration.

The fact that stopping CluserXL and installing policy fixed the issue suggests that the Blink upgraded member came up with no ClusterXL settings, which restored after the policy installed.

Boaz_Orshav · ‎2021-03-30

Hi

I'd like to clarify two issues:

1. SK92449 does not mention Blink because Blink is just another CPUSE package. No special treatment or specific installation instructions for Blink packages.

2. CPUSE (DA - Deployment Agent) is installing a package on a local machine. Hence is does not have any awareness of the other cluster members state. SK92449 refers to local machine installation.

Alex_Shpilman · ‎2021-03-30

Hi @Boaz_Orshav,

1. True, unless Blink Utility is used, which is not clear in this case

2. True again but my point was that in most cases after an upgrade, the CluserXL membership is retained and the upgraded member comes up as "Ready".

Not in this case though, I had a few of these cases before and that's why suggested to shutdown the data ports until policy is installed and MVC is enabled on the upgraded member.

adina · ‎2021-03-31

Thanks @Alex_Shpilman and @Boaz_Orshav for the responses. I can confirm this was an upgrade not using the blink utility.

With regards to the second point the DA might be only aware of what is happening locally however this does not explain why when the firewall rebooted on it’s upgraded version that the CCP did not detect the counter part firewall and move to a ready state.

Just to confirm the experience:

the upgrade was run from CPUSE
the firewall rebooted with an initial policy (expected and replicated in the lab)
the firewall went active as soon as it had rebooted ( not seen in the 4 other times i have tested this in the lab)
I then had to run cphastop to stop them fighting over the VIP.
I could then get back into the environment and push policy to the upgraded firewall.

am · ‎2021-10-04

Did you figure out what happened here? I'm trying convince myself to use Blink images to upgrade clusters to R81 but your experience is not assuring.

Yair_Shahar · ‎2021-03-30

Hi,

Would be good if you can share the following with timestamp of the occurrence so we can try to figure out what happen on this specific case.

We need from both members:

/var/log/messages

$FWDIR/log/fwk.elg (in VSX or USFW case).

Are you a member of CheckMates?

R80.40 cluster blink upgrade gone wrong