Re: Unexpected Reboots

Matlu

Hello, Mates.

I have a CP model 23800 box.
This device is part of a VSX Cluster, but since some months ago, this box is experiencing recurrent “reboot” problems, from one moment to another, it restarts without any reason, or in other extreme cases, the equipment crashes and after a couple of hours it gets up, without intervention from our side.

The case is escalated with the TAC, but they still can't find a reason for this event.

In this scenario, is there anything I can check?
Any particular file, any trace or evidence of why the box is recurrently restarting?

For example, the last reboot occurred between 15pm and 19pm yesterday, and the TAC still can't find a possible root-cause of the problem.

Thanks for your comments.

Tal_Paz-Fridman

Are there any Core Dumps or Kernel Crashes from that time frame?

I would also look at the messages files from that time frame for any potential causes.

Matlu

Hi,

Core Dumps and Kernel Crashes, in which GW path are they hosted?

Revising the messages in this scenario is still an option?

Cheers

Tal_Paz-Fridman

Core Dumps -> /var/log/dump/usermode

Kernel Crashes -> /var/log/crash/

Messages -> /var/log/ -> messages and all the ones after messages.*

the_rock

Hey bro,

Just check what @Tal_Paz-Fridman provided and if you see anything relevant there, upload to TAC case via sftp account and they can analyze.

Andy

Lesley

Run hcp -r all on problem unit. This will also show core dumps etc. Easy way for diagnostics.

What version and take you run? cpinfo -y all

If version is ok and no core dumps etc maybe hardware diag is needed:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_Gaia_AdminGuide/Content/Topics-GAG...

Maybe unit is running hot and shutdowns etc. It is not always software issue

-------
If you like this post please give a thumbs up(kudo)! 🙂

the_rock

Thats an excellent point @Lesley

Andy

AkosBakos

Hi @Matlu

What do you see in fwk.elg and fwk_wd.elg?

eg.: grep "FWK crashed" /var/log/opt/CPsuite-R81.20/fw1/log/fwk_wd.elg

Increase the ticket priotity by TAC to Critical. They will join shortly and do an on-the-fly investigation.

Akos

----------------
\m/_(>_<)_\m/

MichaelOZ

I have reached out by private message for more information. I'm happy to look into this for you.

the_rock

Hey bro,

Do you have any update on this issue?

Andy

Matlu

Hello,

The problem is still being reviewed by the CT.

They can't find the error.

Today the box went down again at 00:00 and did not pick up again, we had to restart the box manually, to get it to pick up again, and when it turned on, it turned on with errors, the Cluster VSX, picked up broken.

It really is a headache.

CT is checking Core Dumps, Crash Files, Hardware Diagnostic, CPinfo and still can't find a concrete answer.

Let's hope for a mirale. 🥲

the_rock

Thats unfortunate...lets hope for the best.

Andy

Chris_Atkinson

Could you please share the version/JHF level of the system?

CCSM R77/R80/ELITE

Matlu

Hi,

R81.20 with JHF Take 84

The device restarts unexpectedly from time to time.
Sometimes it lifts quickly, and other times it takes many hours to lift, and it becomes necessary to force its recovery manually and manually.

Cheers

Lesley

Does it always crash a specific time? right at 00.00 is a bit suspicious. any cron jobs at that moment? Or IPS/AV/AB updates etc?

-------
If you like this post please give a thumbs up(kudo)! 🙂

the_rock

Excellent point Lesley!

Matlu

Hello,

It restarts at any time, it does not have an exact time.
What it is exact, is that the box, at the least thought moment, falls, and sometimes we have to force its ignition manually.

Cheers.

the_rock

Hey bro,

Honestly, if I were you, I would install jumbo 99. At this point, it cant make it worse, only better.

Andy

Chris_Atkinson

Some relevant fixes in JHF takes released since potentially include:

PRJ-56673, PRHF-35637: Memory corruption occurs when a bond interface is configured, leading to a Security Gateway crash with a vmcore or a boot loop.

PRJ-56480, PMTR-107271: In some scenarios, the VSX cluster can take extra time to boot up and activate the Virtual Systems.

CCSM R77/R80/ELITE

the_rock

Yes Chris, sorry, forgot to include those.

This is what comes up when searching for "crash" in take 99 fixes.

Andy

PRJ-59058, PRHF-37185	Security Management	When using SmartWorkflow on a Security Management Server with more than 200 administrators, requests may stall or cause SmartConsole crashes during submission.
PRJ-59118, PMTR-110235	Security Gateway	In a rare scenario, the RAD daemon may crash during large memory allocation operations.
PRJ-58286, PMTR-109114	Anti-Virus	In a rare scenario, when the Anti-Virus blade is enabled, the Security Gateway may crash during traffic inspection.
PRJ-58275, PMTR-110096	SecureXL	SecureXL User Mode crashes if an acceleration card interface has an MTU above 9000 and receives frames larger than 9234 bytes.
PRJ-60103, PMTR-106961	SecureXL	Security Gateway may crash with a vmcore during next hop routing table lookups.
PRJ-59310, PRHF-27173	VoIP	High volumes of VoIP/ SIP traffic may trigger a Security Gateway crash.
PRJ-57472, PRHF-36424	Scalable Platforms	In rare scenarios, Interface Active check may cause a Security Gateway crash when probing a local network.

Matlu

Buddy,

I am not a VSX expert, and I would like to know if in order to upgrade the JHF in the VSX Cluster to try to correct this problem, it is necessary to 'break' the VSX Cluster, and work with the upgrade first on the equipment that is in Stanby and then the one that is Active.

Or is it not necessary to break the Cluster?

Thank you.

the_rock

Correct...so method is the same, no matter the vendor, my friend. Can be Cisco, PAN, Fortinet, Sophos, whatever...you ALWAYS upgrade backup member, reboot, then do master, reboot. I would not bother flipping over to original master member, just leave it as is.

Andy

Matlu

Is it possible to do the Hotfix upgrade on the STANDBY member of the VSX Cluster, without the need to ‘break the cluster’ with the clusterXL_admin down command?

Or is it mandatory to always ‘break’ the cluster?

I see this would be the last way to test if this device is corrected by doing the JHF upgrade to the Cluster.

the_rock

Yes, you can do that, but its better if both members are on same jumbo.

Andy

the_rock

Btw, if you do that, do not leave it like it for more than a day or 2, just my personal opinion.

Andy

the_rock

Not sure if it might be worth installing jumbo 99 if you are on R81.20...

Andy

Are you a member of CheckMates?

Unexpected Reboots