Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Matlu
Advisor

Unexpected Reboots

Hello, Mates.

I have a CP model 23800 box.
This device is part of a VSX Cluster, but since some months ago, this box is experiencing recurrent “reboot” problems, from one moment to another, it restarts without any reason, or in other extreme cases, the equipment crashes and after a couple of hours it gets up, without intervention from our side.

The case is escalated with the TAC, but they still can't find a reason for this event.

In this scenario, is there anything I can check?
Any particular file, any trace or evidence of why the box is recurrently restarting?

For example, the last reboot occurred between 15pm and 19pm yesterday, and the TAC still can't find a possible root-cause of the problem.

Thanks for your comments.

0 Kudos
25 Replies
Tal_Paz-Fridman
Employee
Employee

Are there any Core Dumps or Kernel Crashes from that time frame?

I would also look at the messages files from that time frame for any potential causes.

0 Kudos
Matlu
Advisor

Hi,

Core Dumps and Kernel Crashes, in which GW path are they hosted?

Revising the messages in this scenario is still an option?

Cheers

0 Kudos
Tal_Paz-Fridman
Employee
Employee

Core Dumps -> /var/log/dump/usermode

Kernel Crashes -> /var/log/crash/

Messages -> /var/log/ -> messages and all the ones after messages.*

the_rock
Legend
Legend

Hey bro,

Just check what @Tal_Paz-Fridman provided and if you see anything relevant there, upload to TAC case via sftp account and they can analyze.

Andy

0 Kudos
Lesley
Mentor Mentor
Mentor

Run hcp -r all on problem unit. This will also show core dumps etc. Easy way for diagnostics. 

What version and take you run? cpinfo -y all

If version is ok and no core dumps etc maybe hardware diag is needed:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_Gaia_AdminGuide/Content/Topics-GAG...

Maybe unit is running hot and shutdowns etc. It is not always software issue

-------
If you like this post please give a thumbs up(kudo)! 🙂
the_rock
Legend
Legend

Thats an excellent point @Lesley 

Andy

0 Kudos
AkosBakos
Mentor Mentor
Mentor

Hi @Matlu 

What do you see in fwk.elg and fwk_wd.elg?

eg.: grep "FWK crashed" /var/log/opt/CPsuite-R81.20/fw1/log/fwk_wd.elg

Increase the ticket priotity by TAC to Critical. They will join shortly and do an on-the-fly investigation.

Akos

----------------
\m/_(>_<)_\m/
0 Kudos
MichaelOZ
Employee
Employee

I have reached out by private message for more information. I'm happy to look into this for you.

0 Kudos
the_rock
Legend
Legend

Hey bro,

Do you have any update on this issue?

Andy

0 Kudos
Matlu
Advisor

Hello,

The problem is still being reviewed by the CT.

They can't find the error.

Today the box went down again at 00:00 and did not pick up again, we had to restart the box manually, to get it to pick up again, and when it turned on, it turned on with errors, the Cluster VSX, picked up broken.

It really is a headache.

CT is checking Core Dumps, Crash Files, Hardware Diagnostic, CPinfo and still can't find a concrete answer.

Let's hope for a mirale. 🥲

0 Kudos
the_rock
Legend
Legend

Thats unfortunate...lets hope for the best.

Andy

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Could you please share the version/JHF level of the system?

CCSM R77/R80/ELITE
0 Kudos
Matlu
Advisor

Hi,

R81.20 with JHF Take 84

The device restarts unexpectedly from time to time.
Sometimes it lifts quickly, and other times it takes many hours to lift, and it becomes necessary to force its recovery manually and manually.

Cheers

0 Kudos
Lesley
Mentor Mentor
Mentor

Does it always crash a specific time? right at 00.00 is a bit suspicious. any cron jobs at that moment? Or IPS/AV/AB updates etc?

-------
If you like this post please give a thumbs up(kudo)! 🙂
the_rock
Legend
Legend

Excellent point Lesley!

0 Kudos
Matlu
Advisor

Hello,

It restarts at any time, it does not have an exact time.
What it is exact, is that the box, at the least thought moment, falls, and sometimes we have to force its ignition manually.

Cheers.

0 Kudos
the_rock
Legend
Legend

Hey bro,

Honestly, if I were you, I would install jumbo 99. At this point, it cant make it worse, only better.

Andy

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Some relevant fixes in JHF takes released since potentially include:

PRJ-56673, PRHF-35637: Memory corruption occurs when a bond interface is configured, leading to a Security Gateway crash with a vmcore or a boot loop.

PRJ-56480, PMTR-107271: In some scenarios, the VSX cluster can take extra time to boot up and activate the Virtual Systems.

CCSM R77/R80/ELITE
the_rock
Legend
Legend

Yes Chris, sorry, forgot to include those.

This is what comes up when searching for "crash" in take 99 fixes.

Andy

PRJ-59058,
PRHF-37185

Security Management

When using SmartWorkflow on a Security Management Server with more than 200 administrators, requests may stall or cause SmartConsole crashes during submission.

PRJ-59118,
PMTR-110235

Security Gateway

In a rare scenario, the RAD daemon may crash during large memory allocation operations.

PRJ-58286,
PMTR-109114

Anti-Virus

In a rare scenario, when the Anti-Virus blade is enabled, the Security Gateway may crash during traffic inspection.

PRJ-58275,
PMTR-110096

SecureXL

SecureXL User Mode crashes if an acceleration card interface has an MTU above 9000 and receives frames larger than 9234 bytes.

PRJ-60103,

PMTR-106961

SecureXL

Security Gateway may crash with a vmcore during next hop routing table lookups.

PRJ-59310,
PRHF-27173

VoIP

High volumes of VoIP/ SIP traffic may trigger a Security Gateway crash.

PRJ-57472,
PRHF-36424

Scalable Platforms

In rare scenarios, Interface Active check may cause a Security Gateway crash when probing a local network.

 

0 Kudos
Matlu
Advisor

Buddy,


I am not a VSX expert, and I would like to know if in order to upgrade the JHF in the VSX Cluster to try to correct this problem, it is necessary to 'break' the VSX Cluster, and work with the upgrade first on the equipment that is in Stanby and then the one that is Active.

Or is it not necessary to break the Cluster?

Thank you.

0 Kudos
the_rock
Legend
Legend

Correct...so method is the same, no matter the vendor, my friend. Can be Cisco, PAN, Fortinet, Sophos, whatever...you ALWAYS upgrade backup member, reboot, then do master, reboot. I would not bother flipping over to original master member, just leave it as is.

Andy

0 Kudos
Matlu
Advisor

Is it possible to do the Hotfix upgrade on the STANDBY member of the VSX Cluster, without the need to ‘break the cluster’ with the clusterXL_admin down command?

Or is it mandatory to always ‘break’ the cluster?

I see this would be the last way to test if this device is corrected by doing the JHF upgrade to the Cluster.

0 Kudos
the_rock
Legend
Legend

Yes, you can do that, but its better if both members are on same jumbo.

Andy

0 Kudos
the_rock
Legend
Legend

Btw, if you do that, do not leave it like it for more than a day or 2, just my personal opinion.

Andy

0 Kudos
the_rock
Legend
Legend

Not sure if it might be worth installing jumbo 99 if you are on R81.20...

Andy

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events