- CheckMates
- :
- Products
- :
- Quantum
- :
- Security Gateways
- :
- Re: Unexpected Reboots
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unexpected Reboots
Hello, Mates.
I have a CP model 23800 box.
This device is part of a VSX Cluster, but since some months ago, this box is experiencing recurrent “reboot” problems, from one moment to another, it restarts without any reason, or in other extreme cases, the equipment crashes and after a couple of hours it gets up, without intervention from our side.
The case is escalated with the TAC, but they still can't find a reason for this event.
In this scenario, is there anything I can check?
Any particular file, any trace or evidence of why the box is recurrently restarting?
For example, the last reboot occurred between 15pm and 19pm yesterday, and the TAC still can't find a possible root-cause of the problem.
Thanks for your comments.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are there any Core Dumps or Kernel Crashes from that time frame?
I would also look at the messages files from that time frame for any potential causes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Core Dumps and Kernel Crashes, in which GW path are they hosted?
Revising the messages in this scenario is still an option?
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Core Dumps -> /var/log/dump/usermode
Kernel Crashes -> /var/log/crash/
Messages -> /var/log/ -> messages and all the ones after messages.*
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey bro,
Just check what @Tal_Paz-Fridman provided and if you see anything relevant there, upload to TAC case via sftp account and they can analyze.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Run hcp -r all on problem unit. This will also show core dumps etc. Easy way for diagnostics.
What version and take you run? cpinfo -y all
If version is ok and no core dumps etc maybe hardware diag is needed:
Maybe unit is running hot and shutdowns etc. It is not always software issue
If you like this post please give a thumbs up(kudo)! 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thats an excellent point @Lesley
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Matlu
What do you see in fwk.elg and fwk_wd.elg?
eg.: grep "FWK crashed" /var/log/opt/CPsuite-R81.20/fw1/log/fwk_wd.elg
Increase the ticket priotity by TAC to Critical. They will join shortly and do an on-the-fly investigation.
Akos
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have reached out by private message for more information. I'm happy to look into this for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey bro,
Do you have any update on this issue?
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
The problem is still being reviewed by the CT.
They can't find the error.
Today the box went down again at 00:00 and did not pick up again, we had to restart the box manually, to get it to pick up again, and when it turned on, it turned on with errors, the Cluster VSX, picked up broken.
It really is a headache.
CT is checking Core Dumps, Crash Files, Hardware Diagnostic, CPinfo and still can't find a concrete answer.
Let's hope for a mirale. 🥲
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thats unfortunate...lets hope for the best.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please share the version/JHF level of the system?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
R81.20 with JHF Take 84
The device restarts unexpectedly from time to time.
Sometimes it lifts quickly, and other times it takes many hours to lift, and it becomes necessary to force its recovery manually and manually.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does it always crash a specific time? right at 00.00 is a bit suspicious. any cron jobs at that moment? Or IPS/AV/AB updates etc?
If you like this post please give a thumbs up(kudo)! 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Excellent point Lesley!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
It restarts at any time, it does not have an exact time.
What it is exact, is that the box, at the least thought moment, falls, and sometimes we have to force its ignition manually.
Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey bro,
Honestly, if I were you, I would install jumbo 99. At this point, it cant make it worse, only better.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some relevant fixes in JHF takes released since potentially include:
PRJ-56673, PRHF-35637: Memory corruption occurs when a bond interface is configured, leading to a Security Gateway crash with a vmcore or a boot loop.
PRJ-56480, PMTR-107271: In some scenarios, the VSX cluster can take extra time to boot up and activate the Virtual Systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes Chris, sorry, forgot to include those.
This is what comes up when searching for "crash" in take 99 fixes.
Andy
PRJ-59058, |
Security Management |
When using SmartWorkflow on a Security Management Server with more than 200 administrators, requests may stall or cause SmartConsole crashes during submission. |
PRJ-59118, |
Security Gateway |
In a rare scenario, the RAD daemon may crash during large memory allocation operations. |
PRJ-58286, |
Anti-Virus |
In a rare scenario, when the Anti-Virus blade is enabled, the Security Gateway may crash during traffic inspection. |
PRJ-58275, |
SecureXL |
SecureXL User Mode crashes if an acceleration card interface has an MTU above 9000 and receives frames larger than 9234 bytes. |
PRJ-60103, PMTR-106961 |
SecureXL |
Security Gateway may crash with a vmcore during next hop routing table lookups. |
PRJ-59310, |
VoIP |
High volumes of VoIP/ SIP traffic may trigger a Security Gateway crash. |
PRJ-57472, |
Scalable Platforms |
In rare scenarios, Interface Active check may cause a Security Gateway crash when probing a local network. |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Buddy,
I am not a VSX expert, and I would like to know if in order to upgrade the JHF in the VSX Cluster to try to correct this problem, it is necessary to 'break' the VSX Cluster, and work with the upgrade first on the equipment that is in Stanby and then the one that is Active.
Or is it not necessary to break the Cluster?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Correct...so method is the same, no matter the vendor, my friend. Can be Cisco, PAN, Fortinet, Sophos, whatever...you ALWAYS upgrade backup member, reboot, then do master, reboot. I would not bother flipping over to original master member, just leave it as is.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is it possible to do the Hotfix upgrade on the STANDBY member of the VSX Cluster, without the need to ‘break the cluster’ with the clusterXL_admin down command?
Or is it mandatory to always ‘break’ the cluster?
I see this would be the last way to test if this device is corrected by doing the JHF upgrade to the Cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can do that, but its better if both members are on same jumbo.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Btw, if you do that, do not leave it like it for more than a day or 2, just my personal opinion.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not sure if it might be worth installing jumbo 99 if you are on R81.20...
Andy
