Active Cluster member Freeze and failed to failove...

Mrigen_Sane · ‎2020-04-17

Hello All,

Two days back we upgraded our Checkpoint 12400 cluster members from R77.30 to R80.10 Take 272. Today early morning we saw the Active cluster member into Freeze state meaning all the traffic was dropped and ( Standby cphaprob stat stated Active/Standby ) and we were not able to either ssh or console into the Active device.

For High Availability, cluster_XL is enabled. FW acceleration and Core_XL is also enabled.

When the device was hard rebooted, it came back up for a few mins and then again froze and started rebooting itself.

This happens multiple times and to avoid any other/more outage, we required to isolate that device from the environment.

cpinfo was provided to checkpoint support after several tries to generate that offline. (Still waiting on the response)

NOTE :: This same issue occurred when the device was in R77.30 , and because of which we upgraded them to R80.10 Take 272

Please would appreciate all your inputs.

Regards,

Mrigen Sane

Timothy_Hall · ‎2020-04-17

Sure sounds like hardware to me since the behavior followed you across major software releases, check out sk97251: Using the Check Point Appliance Hardware Diagnostic Tool. Could be a flaky power supply or a dead CPU fan that is causing an overheat condition and subsequent CPU downclocking into oblivion.

Also if you have it up and stable long enough, run cpstat -f sensors os and cpstat -f power_supply os a couple of times over a 30 minute time period or so. Any nonzero values reported?

You aren't just using one power supply with the second power supply bay empty are you? sk107199: Power Supply sensors are shown as "Up" or "Down" when using one Power Supply and one Power...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Mrigen_Sane · ‎2020-04-22

We ran the Checkpoint H/W diagnostic tool on the faulty cluster member and found there was a Hard drive issue.
We have send the .tgz file to checkpoint hardware team , as we had two hard drive in SYNC so if there is a failure in one of the either should have run and not get this into a freeze state (Please correct me or provide some more information into this) or it should have failed over to the secondary member.

Regards

MS

Mrigen_Sane · ‎2020-04-22

PhoneBoy · ‎2020-04-17

Why did you upgrade to R80.10 and not R80.30, which is the widely recommended release at this point?

Mrigen_Sane · ‎2020-04-22

We have many clusters in our environment and this was the last cluster to get an upgrade to R80.10 from R77.30 and we faced this issue. That also after 2 days of the upgrade to the latest take.

Plus we require to be in SYNC with our DR region, so from the organization point of view we moved the upgrade to R80.10 and not R80.30

How About R80.40, is that a stable version, and have the general availability started for that O.S

Regards

MS

PhoneBoy · ‎2020-04-22

R80.40 is a GA, stable version in general, yes, and we have around 1,000 customers using it so far, including some larger Multi-Domain environments.
The feedback on R80.40 has been very positive so far.
Having said that, we only consider something "widely recommended" (and offered via CPUSE) once it meets certain benchmarks, including the availability of a GA Jumbo Hotfix, which hasn't happened yet.
Hopefully, that will be the case in the near future.

Are you a member of CheckMates?

Active Cluster member Freeze and failed to failover to standby R80.10