Tips or Tricks for reverting Cluster firmware vers...

Patrick_Taphorn · ‎2023-09-22

Several weeks ago we upgraded our active/passive cluster appliances (7000 model) from R81.10 to R81.20 + JHF T24. We starting experiencing issues where web conferencing applications such as Teams, Zoom, and WebEx would randomly loose audio for 10-20 seconds. In reviewing log files in SmartEvent I noticed that at the exact times people reported the the web conference apps dropping their audio connections, the Sync Interface on the standby cluster node goes from a state of “down” back to “up” in that same 10-20 second period of time. Both these symptoms started the day after the upgrade to R81.20. Started TAC cases with support back on August 21st and after several debug collections, log reviews, and TCP Dumps, Checkpoint support and R&D are no further along in providing me with a resolution. Planning on reverting back to R81.10 so not to impact my business partners anymore.

Looking for any recommendations, tips, or tricks on how to revert both cluster nodes back to R81.10 using the snapshot while keeping the downtime to a minimum.

PhoneBoy · ‎2023-09-22

On the management side, it's merely an issue of changing the version to R81.10 and pushing policy.
You can't do that until you revert the gateways to the snapshot in question, though.

Possible you might be able to use MVC functionality "in reverse" (revert one member of the cluster, enable MVC).
However, I'm not 100% sure that's supported.
In any case, I would plan for downtime here.

the_rock · ‎2023-09-23

We actually had customer experience exact same issue after upgrade, though sync is fine and there was never a failover after upgrade, they complained about the zoom and ms teams problem, exactly how you described.

What we did was add a rule in internal inline layer allowing everyone to those apps, so src any, dst any, apps anything zoom and ms teams and that seemed to fix the issue. By the way, regardless if mvc was on or off, until we added a rule, there was really no change. Personally, I would not bother reverting, as I always found R81.20 much faster and better performance than R81.10

Hope that helps.

Andy

Best,
Andy

the_rock · ‎2023-09-23

By the way, this is the rule I was talking about that we added AFTER the upgrade to R81.20. Its not really a security concern, as it only applies to internal zone. I really have a gut feeling its related to R81.20, but dont have concrete proof sadly, as customer never complained about it back in R80.40 or R81.10

Andy

Best,
Andy

Patrick_Taphorn · ‎2023-09-25

Thanks for the info the_rock,

Because of a couple unresolved node crashes as well as above problems I mentioned, I ended up deciding to roll the cluster back to R81.10 instead of trying the rule you mentioned. Since rolling back we have not seen the issue with our Sync interface or heard of any Teams/Zoom/WebEx reports of dropped audio. Support and R&D said they will continue to look into and hopefully they find something so we can make the jump back to R81.20 to take advantage of some of the new features offered.

For anyone interested the steps I took to downgrade back to R81.10 were:

From Expert mode on each cluster member enable MVC #cphaconf mvc on
Using the GAIA Web UI I did a revert on the standby node and wait for reboot
Verify standby node (now running as R81.10) joined the cluster again
Using the GAIA Web UI revert the primary node and wait for reboot
Via SmartConsole update the cluster object back to R81.10 and install policy to both nodes.
Turn off MVC on both nodes #cphaconf mvc off

the_rock · ‎2023-09-25

Glad you got it sorted out. I think we wont roll back our client, because R81.20 is a good version, so we will work with TAC on newly opened case to solve any issues that come up. Its unfortunate this happened, but all other clients I dealt with for R81.20 upgrade went just fine, no issues at all.

Andy

Best,
Andy

Feridun_ÖZTOK · ‎2023-09-27

We have same issue and go back to R81 our customer. Customers is unhappy and intolerant for this lost connection. We've turned away 10 customers so far. It looks like we will continue. Now the next upgrade will be with R82. They don't want to go back to R81.20.

the_rock · ‎2023-09-27

As I stated in my previous post, I truly believe R81.20 is a solid version, specially given the amount of testing I had done on it since November of 2022.

Andy

Best,
Andy

Feridun_ÖZTOK · ‎2023-09-27

Everything has been fine since the beginning of the year, except for a few minor problems. But we could not respond quickly enough to the sudden problems. For the customer, it is important that it works, not the version. The way to continue working with CheckPoint was to revert to a working version.

Patrick_Taphorn · ‎2023-09-27

Reassuring to hear others in the community are experiencing similar issues and it’s not just a one-off for our company. Hopefully R&D has gathered enough information from myself and other customers to find a permanent fix to the issue. I have no doubt R81.20 WILL be solid a release but personally I don’t think it was quite ready for the “Recommended Version” title quite yet.

the_rock · ‎2023-09-27

You made very good point.

Andy

Best,
Andy

genisis__ · ‎2023-09-27

Goes back to what seems to be a theme with all vendors, QA does not appear to be as good as it should be, but please keep in mind this is not just Checkpoint.

I have said this before, I would prefer less frequent updates, that are far more stable, then new feature updates. ie. Long Life/ New feature train may be an idea.

Now this all said, I've not gone to R81.20 so I cannot speak from personal experience, however I know "the_rock" is dame through in his testing, so this does give me confidence that we are talking about very specific issues and potentially not widespread.

Your experiences are invaluable feedback and actually you have the correct approach as technology is there to serve a business requirement, the moment this becomes unstable this equates literally to financial loss and should not be tolerated.

Matlu · ‎2023-09-27

I am already "suspicious", in order to bring the products to version R81.20, based on the comments that they expose.

Maybe they are "isolated" problems, but having already had a few bad experiences, for the same reason, this usually becomes an "issue" to think it through, since many times by looking for an improvement, we can get complicated with what we have now.

I think it would be good, that Check Point can comment something about it, because at this moment, suggesting to upgrade to version R81.20, I feel that it is not the best, for now.

the_rock · ‎2023-09-27

Its a tricky situation, but I found its not any issue that it cant be fixed, honestly. I get every customer's situation is different, but I strongly believe that version is solid based on all the testing I had done since November 2022.

Andy

Best,
Andy

the_rock · ‎2023-09-27

Btw bro, we fixed all the issues for the customer (ms teams, zoom), now its just the question of figuring out Radius server failover. We have case with TAC esc. team and were given 2 good SKs about it, so I think we should be good, but may need to verify next time there is primary Radius server failure.

Andy

sk102557 - explains all the various timeout parameters and what they mean

sk42449 - explains the two most important timeouts for RADIUS server failovers

Best,
Andy

Are you a member of CheckMates?

Tips or Tricks for reverting Cluster firmware version from R81.20 back to R81.10