Solved: Issues with MVC mode during R80.40 Take 173 to R81...

Joe_Kanaszka · ‎2024-05-06

Hey mates!

We have a two-node active/standby cluster running on a pair of 5100 appliances. One security gateway node is on R81.20 Take 53 and the other node is on R80.40 Take 173.

MVC is enabled on the R81.20 node. We enabled MVC to allow us to test the new R81.20 OS in our environment before upgrading the other node. We had only planned to be in MVC mode for a short while - upgrading the remaining node the next week.

This upgrade initially occurred at the beginning of April. Due to an emergency medical issue, I needed to take a month off and am just now getting back to work.

Shortly after the upgrade, we noticed the R81.20 node was rebooting on its own and afterward the cluster would roll over to the R80.40 node. After this happened a few times we decided to just leave the R80.40 node as the active node and I would take a look at the issue when I came back to the office.

After opening a ticket with TAC and having them examine my kernel crash files in /var/log/crash for the R81.20 node, they made this determination:

******************************************************************************

Primarily , the issue is happening because of the members being on different versions. The stack trace generated matched a previous task internally; TM-63720, which identifies the crash as a result of a change in table format.

The kernel table in question is : SEP_my_IKE_packet_gtid,
The table values were changed from {local ip, peer ip} to {local id, peer ip} between versions R80.40 and R81.20.

Issue occurs when trying to sync the table between R80.40 and R81.20,
we get owner member as IP and not as ID.

Recommendation;

Please make sure that both members are running with R81.20 and that will resolve the issue.

********************************************************************

So I get this and agree.

However, the whole point of MVC mode in this case was to be able to test the new OS, which I really can't do. The reboots are too frequent (sometimes daily). The R81.20 node reboots even while in standby mode.

What are our options if we want to test before upgrading the other node?

Can we disable MVC? Will this break the cluster and cause weirder things to happen?

Can we manually edit the above-referenced table causing the issue on the R80.40 node, thereby getting rid of the sync issue?

I suppose we can break the cluster and simply run on the new node, but this makes me nervous. I really would rather not do this.

Thanks, everyone! Interested in hearing your thoughts.

Joe

emmap · ‎2024-05-06

The function that MVC is designed for is to allow a gateways to cluster nicely temporarily while on two different versions, inside of a change window. It's not designed to run like that for long periods, hence that's not something we do QA testing with and can't guarantee that it'll work longer term. The general recommended procedure is to upgrade one member, enable MVC to get them clustered and sync'd, fail over to the upgraded member, test your critical apps to ensure that the important stuff works and then upgrade the other member. If you end up with issues longer term that didn't come up in the window, either fix forward with TAC assistance or roll back to the snapshot that was automatically taken when you did the upgrade, one member at a time (enable MVC on the member you didn't roll back yet to gracefully downgrade).

View solution in original post

the_rock · ‎2024-05-06

Hey brother,

First off, hope you are OKAY! Health always comes first. Now, here is what I can tell you having dealt with this exact situation with a customer 2 years ago.

So, here is what we did...we upgraded one member to R81.10 and left MVC on and left it as off on R80.40 member. After 3 days or so, we turned off MVC on upgraded member, enabled on R80.40 and upgraded that one and all was fine.

So, thats what I would do in your case. Disable mvc on R81.20 cluster member, no need to push policy, enable mvc on non-upgraded member, upgrade, reboot, test.

Andy

Joe_Kanaszka · ‎2024-05-06

Thanks Andy! I'm doing fine - just had to take it easy for a bit to recuperate.

So talk to me more about this solution you're proposing:

I was under the impression that MVC mode needed to be enabled on the upgraded node?

So you're saying that we can turn off MVC on the R81.20 node, then enable on the R80.40 node and we'll be able to test? We can run on the R81.20 active node and keep the other node on R80.40 with MVC enabled while we test?

Hopefully, we can upgrade the other node in the next couple of weeks.

Thanks Andy!

the_rock · ‎2024-05-06

Sorry, I said it wrong...you are 100% right...KEEP it enabled on upgraded member, once you are ready to upgrade the other one, enable it there as well, upgrade, reboot. then you can turn it off on both.

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_Installation_and_Upgrade_Gui...

Andy

Joe_Kanaszka · ‎2024-05-06

OK - Thanks Andy!

But my issue is that currently I cannot test the R81.20 node because the node is rebooting on its own, sometimes as often as everyday. We still need to run on this new node and make sure it is running properly.

How do I test my upgraded node and not have it reboot everyday?

Thanks again!

the_rock · ‎2024-05-06

Ok, few questions...

1) Was this happening BEFORE the upgrade?

2) Any crash files in /var/log/crash dir?

3) What is cpu usage? memory?

4) Can you send output of below?

[Expert@CP-FW-01:0]# fw tab -t connections | grep limit
dynamic, id 8158, num ents 10, load factor 0.0, attributes: keep, sync, aggressive aging, kbufs 21 22 23 24 25 26 27 28 29 30 31 32 33 34, expires 25, refresh, , hashsize 2097152, unlimited
[Expert@CP-FW-01:0]#

[Expert@CP-FW-01:0]# fw tab -t connections -s
HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 21 1574 39
[Expert@CP-FW-01:0]#

Joe_Kanaszka · ‎2024-05-06

Thanks Andy!

Let me get all this information rounded up for you. I have a Zoom in 20 minutes but I'll try and have for you by tomorrow. Thanks again Andy!

-Joe

the_rock · ‎2024-05-06

10-4 mate

Andy

the_rock · ‎2024-05-06

Another suggestion...maybe it would not be a bad idea to open TAC case as well, since fw rebooting this frequently is definitely a major issue, I would say.

Andy

Joe_Kanaszka · ‎2024-05-06

I agree. I've opened up a new SR last week. Here is what they say:

*********************

Primarily , the issue is happening because of the members being on different versions. The stack trace generated matched a previous task internally; TM-63720, which identifies the crash as a result of a change in table format.

The kernel table in question is : SEP_my_IKE_packet_gtid,
The table values were changed from {local ip, peer ip} to {local id, peer ip} between versions R80.40 and R81.20.

Issue occurs when trying to sync the table between R80.40 and R81.20,
we get owner member as IP and not as ID.

Recommendation;

Please make sure that both members are running with R81.20 and that will resolve the issue.

********************

So this response from TAC is what led me to check with you guys.

How can we test in MVC mode if the newly upgraded node, that we want to run on, is rebooting?

TAC is saying the node is rebooting because of a table discrepancy between the two versions. (See above).

So in this scenario is MVC mode not an option?

the_rock · ‎2024-05-06

They do have a point brother. I would definitely upgrade the other one as soon as you can.

Andy

emmap · ‎2024-05-06

The function that MVC is designed for is to allow a gateways to cluster nicely temporarily while on two different versions, inside of a change window. It's not designed to run like that for long periods, hence that's not something we do QA testing with and can't guarantee that it'll work longer term. The general recommended procedure is to upgrade one member, enable MVC to get them clustered and sync'd, fail over to the upgraded member, test your critical apps to ensure that the important stuff works and then upgrade the other member. If you end up with issues longer term that didn't come up in the window, either fix forward with TAC assistance or roll back to the snapshot that was automatically taken when you did the upgrade, one member at a time (enable MVC on the member you didn't roll back yet to gracefully downgrade).

emmap · ‎2024-05-06

And in breaking news, this specific issue has been located and patched, the fix should be in a future JHF take. Look for PRJ-54611. If you need it before, contact TAC and they can ask for a port of it.

Joe_Kanaszka · ‎2024-05-07

Thank you so much emmap!

So currently it appears that the upgraded node is rebooting on average about every day. Over the past month, the max time the upgraded node stayed up before rebooting was about 8 days. It is rebooting while in active and standby state. After the upgrade on Friday April 5th, we were able to test while on R81.20 and all of our systems worked that night. We wanted to give it a week and upgrade the other node the following Friday. (We do major upgrades like this after close of business on Fridays.) Then I ran into a medical emergency and was forced to take off until recently.

So basically we need to upgrade the other node to get the cluster stable. Correct?

Thank you again!

the_rock · ‎2024-05-07

I would do it as soon as you can Joe.

Andy

emmap · ‎2024-05-06

Just a note, you never need to enable MVC on the lower version cluster member. Its function is to enable a higher version to cluster with a lower version, not the other way around.

the_rock · ‎2024-05-07

Thats not quite true according to an official CP documentation. I was told exact same thing by TAC as well. Unless, the example given in the document is slightly different, as it gives case of 3 members, not 2...

Andy

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_Installation_and_Upgrade_Gui...

the_rock · ‎2024-05-07

@emmap I see what you are saying now...I read the doc again and makes sense. There are 3 members given in that example and M3 and M2 are given to enable MVC, but NOT M1, so your statement is totally logical.

Thanks for pointing that out, I honestly had no idea that was the case, but will remember from now on it only needs to be enabled for the time being on the upgraded member.

Cheers,

Andy

emmap · ‎2024-05-07

It can get confusing, especially when the R81.20 JHF enables it for you.

the_rock · ‎2024-05-07

But thank you very much for clarifying, I honestly had no idea that was the case. I always thought it had to be enabled on all members, but its good to know its only done on whatever member will be upgraded first (usually backup)

Best,

Andy

Joe_Kanaszka · ‎2024-05-07

Wait - it does? MVC is automatically enabled?

the_rock · ‎2024-05-07

Thats another thing I was not aware of either...

Joe_Kanaszka · ‎2024-05-07

Hi Emmap!

Is MVC enabled by default when you upgrade a node to R81.20?

Also, can you please explain this excerpt from the R81.20 ClusterXL Administration Guide?

R81.20 ClusterXL Administraton Guide

If a specific scenario requires you to disable the MVC Mechanism before the first start of an R81.20 Cluster Member (for example, immediately after an upgrade to R81.20), then disable it before the first policy installation on this Cluster Member.

Not sure I understand this.

Thank you again!

emmap · ‎2024-05-07

The R81.20 JHF from take 14 up will enable MVC when you install it. It's in the Important Notes section of the JHF documentation. Apparently it's required due to one of the changes made in that JHF, but I don't have additional details. GA R81.20 does not have MVC enabled.

I'm not sure of what kind of scenario that statement is talking about, but it might be warning that if MVC is enabled before you do the initial policy install, the gateway will immediately join the cluster and may take over the active rule if the cluster is configured that way. Disabling MVC would prevent this takeover if the other cluster member is still on a lower version.

Joe_Kanaszka · ‎2024-05-08

Ah ok. Thank you again!

the_rock · ‎2024-05-08

Everything okay now brother?

Joe_Kanaszka · ‎2024-05-08

All good. We've coordinated with Check Point TAC and our head office for the upgrade of the secondary node next Friday.

Thanks again Andy!

the_rock · ‎2024-05-08

You can thank @emmap . She is obviously way smarter than I am haha 🙂

I learned something new, so definitely will keep that in mind next time I do cluster upgrade.

Best,

Andy

Joe_Kanaszka · ‎2024-05-08

Take it easy man!

-Joe

the_rock · ‎2024-05-08

I am, haha.

Are you a member of CheckMates?

Issues with MVC mode during R80.40 Take 173 to R81.20 Take 53 upgrade