Re: Unexpected Reboots

Matlu · ‎2025-03-26

Hello, Mates.

I have a CP model 23800 box.
This device is part of a VSX Cluster, but since some months ago, this box is experiencing recurrent “reboot” problems, from one moment to another, it restarts without any reason, or in other extreme cases, the equipment crashes and after a couple of hours it gets up, without intervention from our side.

The case is escalated with the TAC, but they still can't find a reason for this event.

In this scenario, is there anything I can check?
Any particular file, any trace or evidence of why the box is recurrently restarting?

For example, the last reboot occurred between 15pm and 19pm yesterday, and the TAC still can't find a possible root-cause of the problem.

Thanks for your comments.

Tal_Paz-Fridman · ‎2025-03-26

Are there any Core Dumps or Kernel Crashes from that time frame?

I would also look at the messages files from that time frame for any potential causes.

Matlu · ‎2025-03-26

Hi,

Core Dumps and Kernel Crashes, in which GW path are they hosted?

Revising the messages in this scenario is still an option?

Cheers

Tal_Paz-Fridman · ‎2025-03-26

Core Dumps -> /var/log/dump/usermode

Kernel Crashes -> /var/log/crash/

Messages -> /var/log/ -> messages and all the ones after messages.*

the_rock · ‎2025-03-26

Hey bro,

Just check what @Tal_Paz-Fridman provided and if you see anything relevant there, upload to TAC case via sftp account and they can analyze.

Andy

Best,
Andy

Lesley · ‎2025-03-26

Run hcp -r all on problem unit. This will also show core dumps etc. Easy way for diagnostics.

What version and take you run? cpinfo -y all

If version is ok and no core dumps etc maybe hardware diag is needed:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_Gaia_AdminGuide/Content/Topics-GAG...

Maybe unit is running hot and shutdowns etc. It is not always software issue

-------
Please press "Accept as Solution" if my post solved it 🙂

the_rock · ‎2025-03-26

Thats an excellent point @Lesley

Andy

Best,
Andy

AkosBakos · ‎2025-03-27

Hi @Matlu

What do you see in fwk.elg and fwk_wd.elg?

eg.: grep "FWK crashed" /var/log/opt/CPsuite-R81.20/fw1/log/fwk_wd.elg

Increase the ticket priotity by TAC to Critical. They will join shortly and do an on-the-fly investigation.

Akos

----------------
\m/_(>_<)_\m/

MichaelOZ · ‎2025-03-27

I have reached out by private message for more information. I'm happy to look into this for you.

the_rock · ‎2025-03-29

Hey bro,

Do you have any update on this issue?

Andy

Best,
Andy

Matlu · ‎2025-03-29

Hello,

The problem is still being reviewed by the CT.

They can't find the error.

Today the box went down again at 00:00 and did not pick up again, we had to restart the box manually, to get it to pick up again, and when it turned on, it turned on with errors, the Cluster VSX, picked up broken.

It really is a headache.

CT is checking Core Dumps, Crash Files, Hardware Diagnostic, CPinfo and still can't find a concrete answer.

Let's hope for a mirale. 🥲

the_rock · ‎2025-03-29

Thats unfortunate...lets hope for the best.

Andy

Best,
Andy

Chris_Atkinson · ‎2025-03-29

Could you please share the version/JHF level of the system?

CCSM R77/R80/ELITE

Matlu · ‎2025-03-30

Hi,

R81.20 with JHF Take 84

The device restarts unexpectedly from time to time.
Sometimes it lifts quickly, and other times it takes many hours to lift, and it becomes necessary to force its recovery manually and manually.

Cheers

Lesley · ‎2025-03-30

Does it always crash a specific time? right at 00.00 is a bit suspicious. any cron jobs at that moment? Or IPS/AV/AB updates etc?

-------
Please press "Accept as Solution" if my post solved it 🙂

the_rock · ‎2025-03-30

Excellent point Lesley!

Best,
Andy

Matlu · ‎2025-03-30

Hello,

It restarts at any time, it does not have an exact time.
What it is exact, is that the box, at the least thought moment, falls, and sometimes we have to force its ignition manually.

Cheers.

the_rock · ‎2025-03-30

Hey bro,

Honestly, if I were you, I would install jumbo 99. At this point, it cant make it worse, only better.

Andy

Best,
Andy

Chris_Atkinson · ‎2025-03-30

Some relevant fixes in JHF takes released since potentially include:

PRJ-56673, PRHF-35637: Memory corruption occurs when a bond interface is configured, leading to a Security Gateway crash with a vmcore or a boot loop.

PRJ-56480, PMTR-107271: In some scenarios, the VSX cluster can take extra time to boot up and activate the Virtual Systems.

CCSM R77/R80/ELITE

the_rock · ‎2025-03-30

Yes Chris, sorry, forgot to include those.

This is what comes up when searching for "crash" in take 99 fixes.

Andy

PRJ-59058, PRHF-37185	Security Management	When using SmartWorkflow on a Security Management Server with more than 200 administrators, requests may stall or cause SmartConsole crashes during submission.
PRJ-59118, PMTR-110235	Security Gateway	In a rare scenario, the RAD daemon may crash during large memory allocation operations.
PRJ-58286, PMTR-109114	Anti-Virus	In a rare scenario, when the Anti-Virus blade is enabled, the Security Gateway may crash during traffic inspection.
PRJ-58275, PMTR-110096	SecureXL	SecureXL User Mode crashes if an acceleration card interface has an MTU above 9000 and receives frames larger than 9234 bytes.
PRJ-60103, PMTR-106961	SecureXL	Security Gateway may crash with a vmcore during next hop routing table lookups.
PRJ-59310, PRHF-27173	VoIP	High volumes of VoIP/ SIP traffic may trigger a Security Gateway crash.
PRJ-57472, PRHF-36424	Scalable Platforms	In rare scenarios, Interface Active check may cause a Security Gateway crash when probing a local network.

Best,
Andy

Matlu · ‎2025-03-30

Buddy,

I am not a VSX expert, and I would like to know if in order to upgrade the JHF in the VSX Cluster to try to correct this problem, it is necessary to 'break' the VSX Cluster, and work with the upgrade first on the equipment that is in Stanby and then the one that is Active.

Or is it not necessary to break the Cluster?

Thank you.

the_rock · ‎2025-03-30

Correct...so method is the same, no matter the vendor, my friend. Can be Cisco, PAN, Fortinet, Sophos, whatever...you ALWAYS upgrade backup member, reboot, then do master, reboot. I would not bother flipping over to original master member, just leave it as is.

Andy

Best,
Andy

Matlu · ‎2025-03-30

Is it possible to do the Hotfix upgrade on the STANDBY member of the VSX Cluster, without the need to ‘break the cluster’ with the clusterXL_admin down command?

Or is it mandatory to always ‘break’ the cluster?

I see this would be the last way to test if this device is corrected by doing the JHF upgrade to the Cluster.

the_rock · ‎2025-03-30

Yes, you can do that, but its better if both members are on same jumbo.

Andy

Best,
Andy

the_rock · ‎2025-03-30

Btw, if you do that, do not leave it like it for more than a day or 2, just my personal opinion.

Andy

Best,
Andy

the_rock · ‎2025-03-30

Not sure if it might be worth installing jumbo 99 if you are on R81.20...

Andy

Best,
Andy

Are you a member of CheckMates?

Unexpected Reboots