Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Chinmaya_Naik
Advisor

Check Point Security Gateway freezes, crashes, or reboots randomly, core dump files are not created

CheckMates Admin Note: This document is an extract of sk31511 SecureKnowledge Article. Please refer to the SK for more details and full view of the procedure

***************************************************************************************************************************************

CheckPoint Security gateway freezes, crashes, or reboots randomly, core dump files are not created

****************************************************************************************************************************************

Due to some various cause , the security gateway may freeze or crash or during the policy installation also gateway may freeze or crash and also in such cases no such relevant information can be found on system logs.

As we know that  GAIA OS can be configured to dump core file.

The work of the core dump file or core file is to records the memory images of running processes and its process status like register value, etc.

So whenever any failure happens then automatically core dump file is generated.

But some time failure may so hard that core dump file not be generated.

So to find out the root cause we need to extract the necessary information from the operating system (memory stack).

NOTE :- The below procedure is not impact on performance because we simply change in the configuration of Linux Kernel.  

I only demonstrated for GAIA OS (R75.40 or above) running on 32 bit or 64 bit by in proper step by step procedure.

Please refer the sk31511 for below platform.

  1. NGX SecurePlatform 2.6 (R70 and above)
  2. NGX SecurePlatform 2.6 (R65 ENFv26)
  3. NGX SecurePlatform 2.4 (R60 - R65)
  4. VSX NGX R67 SecurePlatform 2.6
  5. VSX NGX R65 SecurePlatform 2.4
  6. VSX NGX Scalability Pack SecurePlatform 2.4
  7. VSX NGX SecurePlatform 2.4
  8. NG AI SecurePlatform 2.4

NOTE : This procedure only supported on GAIA and SecurePlatform OS.

Scenario 01 :-  Some time issue may occur on both of the Gateway.

Scenario 02 :- Some time issue may occur on only one Gateway. 

REQUIREMENT:

  1. Serial Cable (RS232)
  2. Laptop with Adapter
  3. SHH Client for console access (PUTTY /Secure CRT/HyperTerminal/TeraTerm/ETC)

NOTE :- Before we run the below procedure as I recommended to run the HARDWARE DIAGNOSTIC on problematic Gateway.

Refer sk97251 for more details and also I recommended to run Interface test as well when running Hardware Diagnostics (loopback cable is required for interface test).

If Hardware Diagnostics is successfully test and no issue found like all test passed then processed for the below procedure.

IMP NOTE: 1. Before you start to debug mode reboot is required on the problematic security gateway and that gateway is going to Active after the gateway come up.

                    2. If I do on Standby Member then Standby become Active.

 

STEP 01 :- Need to check the Serial Terminal (ttyS0 or ttyS1)

  1. Take console access to the gateway by using serial cable and type ''w'' command in expert mode then we able to find out the output its either (ttyS0 or ttyS1)
  2. Note down serial terminal output.

 

STEP 02 :- Check the GAIA OS  edition (32 bit or 64 bit)

  1. Now take SSH to the problematic Gateway.
  2. In CLISH mode run "show version all" and we able to find out (32 bit or 64 bit)
  3. Note down the output.

 

STEP 03 :- Backup the  "grub.conf"file.

  1. Now take SSH to the problematic Gateway .
  2. Go to EXPERT Mode and type  "cp /boot/grub/grub.conf /boot/grub/grub.conf_backup"

   3. Verify the backup file is exit or not by "ls" command on "/boot/grub/" location.

 

STEP 04 :- Need to modify the value of  "console=" parameter  by editing the "grub.conf" file.

  1. Example: "console=ttyS0" or"console=ttyS1"
  2. Base on the 32bit or 64bit OS edition we find out below output when editing the "grub.conf" file.
  3.  Example:-

For Gaia OS 64-bit (R75.40 and above)

title Start in 64bit online debug mode

    root (hd0,0)

    kernel /vmlinuz-x86_64 ro  root=/dev/vg_splat/lv_current vmalloc=256M  panic=15 console=CURRENT kdb=on crashkernel=128M@16M 3

    initrd /initrd-x86_64

  4. Now change the console parameter to "console=ttyS0" or "console=ttyS1"

For Example:-

title Start in 64bit online debug mode

    root (hd0,0)

    kernel /vmlinuz-x86_64 ro  root=/dev/vg_splat/lv_current vmalloc=256M  panic=15 console=ttyS0 kdb=on crashkernel=128M@16M 3

    initrd /initrd-x86_64

NOTE :- For 32bit OS  edition refer  "title Start in 32bit online debug mode" and rest of process are same as above.

  5. Save the "grub.conf" file and type"wq!"

NOTE :- In some case when we run KDB mode then sometimes we see the message like "Oops" because  USB drivers can cause conflict with KDB.

So  for that please add the "nousb".

Example:-

title Start in 64bit online debug mode

    root (hd0,0)

    kernel /vmlinuz-x86_64 ro  root=/dev/vg_splat/lv_current vmalloc=256M  panic=15 console=ttyS0 kdb=on nousb crashkernel=128M@16M 3

    initrd /initrd-x86_64

 

STEP 05 :- Connect the Laptop to the Security Gateway using a serial cable.(RS-232)

  1. Open SSH client (PUTTY /Secure CRT/Hyperterminal/Tera Term/ETC)
  2. Below is the pre-configuration before we take SSH.
  3. Data bits: 8
  4. Baud rate: Default 9600 / Depend upon the Hardware (Follow sk108095)
  5. Parity: None
  6. Stop bits : 1
  7. Flow Control (Enable)

       7.1. DTR/DSR

       7.2. RTS/CTS

       7.3. XON/XOFF

 

STEP 06 :- Increase the "Scrollback buffer" size because it required for record the more information. "Scrollback buffer = 32000"

 

STEP 07 :- Need to save the logs before you start the session.

For Example:-

Using Putty :- Go to Session ----   Logging -----  Select All session Output (Select the location to save the logs)

 

NOTE :- You can also save the logs later once you run the all relevant command but I recommended to do the above step (STEP 07) before run any command to collect the logs.

 

STEP 08 :- Reboot the Security Gateway.

  1. In Expert mode run command "reboot" and type "y"

 

STEP 09 :- Need to enter into the Boot menu.

  1. Once we reboot the problematic security Gateway then after some time we able to see the below prompt.

"Press any key to see the boot menu..." [Booting in 5 seconds]

At this time we need to press any key to open into the boot menu.

 

STEP 10 :- Now we need to start the machine into the Debugging Mode.

  1. On the boot menu we able to see some option.
  2. Now base on the changes that we have done by modifying the value in "grub.conf" [refer the STEP 4]

       So as per the STEP 4 , I change the value in "title Start in 64bit online debug mode" where we change the value to console=ttyS0. So I start the debug               mode by selecting the option"Start in 64bit online debug mode "

   3. After start in debug mode waiting some time to load and then we able to see the login prompt then we put the login details and password for login.

   4. After the login now the Problematic security gateway become Active. (Previously that gateway may be active or standby)

 

STEP 11 :- Enable online Kernel Debugger by "echo" command.

  1. [Expert@HostName]# echo 1 > /proc/sys/kernel/kdb
  2. [Expert@HostName]# cat /proc/sys/kernel/kdb (we able to see "1" as output)

 

STEP 12 :- Need to test that we able communicate with the kernel Debugger or not.

  1. For that we need to enter kernel debugger prompt so press one of the below command.

      "Ctrl + A" , "Ctrl + C" , "Ctrl + AA" , " ESC +K+D+B" , Send a Break Signal

NOTE:- Make sure that the Keyboard layout must be English, otherwise Gateway will freeze.

   2. Now we able to see the kernel prompt like "kdb>" and below messages.

   "Entering kdb (current=0xXXXXXXXX, pid 0) due to Keyboard Entry"

 

NOTE :- Make sure that when we are in "kdb" prompt and no other process is running.

 

STEP 13 :- Need to check Memory stack is display or not.

  1. Run the command "bt".

For Example :-

EBP        EIP         Function(args)

0x8194beac 0x8011eca1 context_switch+0x81 (0x803e3980, 0x8194a000, 0x803b4000, 0x43fc, 0x8194bef4)

                               kernel .text 0x80100000 0x8011ec20 0x8011ecf0 

OR like this below

 EBP        EIP         Function(args)

0x807adf80 0x80405deb apic_timer_interrupt+0x1f (0x0, 0x807ac000, 0x80403e10)

 

STEP 14 :- Exit the on-line kernel debugger mode and enter into the regular prompt.

  1. Run the command "kdb>go" .
  2. After we able to see the regular prompt like "[Expert@HostName]#".
  3. Now on this stage, all functionality on the problematic  Security  Gateway is normal.

 

STEP 15 :- Now we need to wait for the next freeze or crash to happen.

: - Try to install the policy multiple time to check whether the gateway is going to freeze or not.

 

IMP NOTE :- Make sure that the security gateway is connected to Laptop through the console port.

:- It required because if Gateway will freeze then we directly run the required command by entering to the "kdb" mode.

:- Because during the issue we have a limited time to run the required command so better we connect the laptop to the security gateway.

:- If we not connect any laptop to the security Gateway then during the issue we need console access to the security gateway and run the required command and that takes some time to do so I recommended to connect the laptop still the next issue to happen.

 

STEP 16 :- Once we face the issue like gateway freeze or crash then below command need to run.

  1. kdb>ps (Complete list of process that running in the Gateway)
  2. kdb>dmesg (To display the syslog buffer) (NOTE: Press"ENTER" till the end output because we need to copy the complete output).
  3. kdb>summery (To run kernel version memory summery)

NOTE: If you have multiple CPU then need to perform the below command in the context of each CPU core.

STEP 17 :-  kdb>cpu  (To check the available CPU contexts)

For Example :- If we have a 4 CPU core then

kdb>cpu

Output :-Currently on cpu 0 
        Available cpus: 0 , 1 , 2 , 3

As  we already  on  CPU 0 and need to run the below command for rest of CPU core.

kdb>cpu 1

 (NOTE : After  put cpu 1 command below line we able to see a message like  "Entering kdb (current0xb7c4e530, pid 0) on processor 1 due to cpu switch")

kdb>bt

-----------------------------

kdb>cpu2

kdb>bt

-----------------------------

kdb>cpu3

kdb>bt

-----------------------------

STEP 18 :- Now copy all the log and save to any Notepad application for safer side and also we already configure to save the logs to a particular file location.(STEP 07)

 

STEP 19 :- Return to normal shell

  1. kdb>go

 

IMP NOTE :- Some time maybe you got an output like below 

"Catastrophic error detected"

"kdb_continue_catastrophic=0, type go a second time if you really want to continue"

"kdb_continue_catastrophic=0, attempting to continue"

"Kernel panic - not syncing

:- In this CASE may be KDB prompt is stuck or maybe crash so in this scenario reboot the Security Gateway manually or wait some time to the gateway to crash and reboot automatically.

:- Also if sometimes "bt" command does not run then go to normal mode (follow the STEP:12) and run the "bt" command again.

 

STEP 20 :- Disable the on-line kernel debugger.

[Expert@HostName]# echo 0 > /proc/sys/kernel/kdb

[Expert@HostName]# cat /proc/sys/kernel/kdb

NOTE :- Output show as "0"

 

STEP 21 :- Now reboot the Security Gateway for start the machine on Normal Mode.

I.[Expert@HostName]#reboot (type "y" )

 

STEP 22 :- Follow the STEP 09 to enter into the boot menu.

Now we need the start problematic Gateway as "Normal Mode" so select "Start in Normal Mode"

 

FINAL STEP :- Need to analyze the output (ps,bt and dmesg command) to find out the root cause.

 

 #Chinmaya Naik

Network Security Engineer , QOS Technology PVT LTD. , INDIA

13 Replies
Jerry
Mentor
Mentor

if I may

I will be totally honest with you not necessarily helpful sorry for that in advance

1. upgrade it all to R80.10 as fighting this up will cost you more time than making everythig from scratch with gaia R80.10

2. t-shooting it depends on many factors, way way too many to state here: this or that. Honestly tough one.

3. assesing what you've just wrote would be extermelly difficult for anyone knowing nothing about your infrastructure and sourrounding environment hence I rather pass than give you a good hints.

4. been there done that and know that sometimes when you run OS which is essentially 10y old is not really a secure and reliable option.

that was my 5 cents and sorry I didn't meant to critisize or negate your ideas. just wanted to advise that your best bet would be to UPGRADE IT ALL in stages and have peace of mind afterwards Smiley Happy

--jerry--

Jerry
Chinmaya_Naik
Advisor

Dear Jerry Let me clear you that 

  1. "Due to some various cause , the security gateway may freeze or crash or during the policy installation also gateway may freeze or crash and also in such cases no such relevant information can be found on system logs."
  2. As this kind of issue will happen on R75, R76, R77, R77.10, R77.20, R77.30 and also R80.10 as well.
  3. As in the document also i mention as R75.40 and above.
  4. I already did this activity on one of the customer environment with R77.30 security Gateway Running.

Please requesting you to go through the document its look like really long but usually this activity not take more that half an hour  but time may be vary because reboot is required.

looking forward to your positive response JerrySmiley Happy.

#Chinmaya Naik Smiley Happy

0 Kudos
Jerry
Mentor
Mentor

alright let's try again though:

1. what is the HW platform? is it Appliance based, VM (VMWare or HyperXX) what is that?

3. show me the outcome of "df -ahls" please

4. include dmseg from the "affected device"

5. show me the cpinfo -y all

6. show me cplic print -x

7. and do let me know should you have any SR with Check Point Support, did you rise this with anyone from the Vendor or not yet?

Jerry
0 Kudos
Chinmaya_Naik
Advisor

Dear Jerry thanks for the update and thank you for your valuable time for discussion as well.
As we face issue with on some of the customers but below some details that we collect.from one of the customer environment.
NOTE: Now the issue is resolved partially and put in monitoring. (By replace the primary gateway )
We already working with GTAC for 2 months and finally, we go through the document that I share.
We collect the logs as well and waiting for R & D for the analysis.
Dmesg:-  No relevant information
License and contract are valid.
******************************************************************************************************************************************************************************************************
******************************************************************************************************************************************************************************************************
Model : Checkpoint 12600 
_____________________________
Gateway Blade:
1.Firewall
2.Application Control
3.URL Filtering
4.Mobile access
5.IPS
6.Anti-BOT
7.Anti-Virus
8.ClusterXL
____________________________
MGMT Model:-
Open Srver PowerEdege R630
____________________________
MGMT Blade Enabled:
1.Network Policy Management
2.logging and Status
3.Provisioning
________________________________
Three ISP
1.ISP 1 135 Mbps
2.ISP 2 100 Mbps
3. ISP 3 100 Mbps
___________________________
ClusterXL mode HA(High Availability  Active/Standby)
OS: GAIA R77.30
________________________________________
Per Day Log size : Average size 1.224 Gb
_____________________________________________________________________________
As the issue happens during the policy installation, Gateway gets frozen but it does not happen every time, where its most of the time gateway frozen happen randomly like automatically as well.
gateway froze happen on non-production hours.
Some time customer face gateway froze issue with PRIMARY and sometimes BACKUP.
CCP packet use as multicast for synchronization.
_______________________________________________________________________________________
Firewall Rule:-
Total 748 total rule where some are the temporary rule.
Application Rule
Total 59 total rule where are 40 are active and one temporary rule. 
_________________________________________________________________________________________
SecureXl in ON
SecureXL rule is disable by rule no 248
We move the rule to down now its disable by rule 594
___________________________________________________________________________________________
CoreXL 
CoreXL  Total 12 number of CORE. 
Total 10 number CloreXL FW instances are running
Dynamic Dispatcher is Enable.
_____________________________________________________________________________________________________________________________
DOMAIN OBJECT
Rule no 681 to 699 domain object rule applies but as domain object rule is in down where the total number of rule is 748.
Rule No 682 has 43K hits
Rule No 687 has 75k Hits
Rule No 692 has 29k Hits
Rule No 696 has 13k Hits
__________________________________________________________________________________________________
In application and URL section cleanup rule is there but it's not usually recommended.
Rule No 59 hits 3G (34.4%) hits 
Allow Rule for below application (take 320 Million Hits)
1.Outlook
2.One Drive
3.Skype
4.Sharepoint
NOTE: HTTPS Inspection is disable.
_____________________________________________________________________________________________________
HOTFIX: Jumbo HotFix take_302
As per last tracker log, we could see that 2 antivirus logs before froze happened as also in that antivirus logs all information is blanked.
#Chinmaya Naik
***********************************************************************************************************************************************************************************
***********************************************************************************************************************************************************************************
0 Kudos
Jerry
Mentor
Mentor

thanks. that brings some new lights into the discussion. now, following this I'd like you to provide some outputs from the shell if you don't mind. please paste here following:

1. cpwd_admin list

1a. cphaprob -a if (mark up IPs!)

2. top -H (snap)

3. fwaccel stats -s + fwaccel stat

4. cpinfo -y all

5. df -ahls

once we've got this we should be clear whether there is any SK which may help, if not I see no option other than t-shooting it by Remote Session with one of the CP Support SEs.

for me it looks like performance issue most likely related to either sim affinity or acutally lack of it (or missing it badly), aslo it may seem like some processes are corrupted during the start up time -depends what is the current build number but we can see that already (Jumbo HotFix take_302).

please provide as stated above and let's see what else could be advised.

3. w3.

Jerry
0 Kudos
Chinmaya_Naik
Advisor

Dear Jerry thanks for the update.

Stil last 10 days we observer the behaviour of the security gateway and still its ok. (After replacing the primary security gateway).

I will update you if we face the issue again.

Thanks, Jerry for your time.

Samirshah-1
Explorer

Did you replace physical hardware means RMA 

 

similar problem I am facing with R80.20 

0 Kudos
Lloyd_Crosby
Contributor

So, why did you literally copy Check Point Security Gateway on SecurePlatform / Gaia freezes, crashes, or reboots randomly, core du...  ? aka sk31511 CHINMAYA NAIK

Also, unless you understand the functions that the kdb spits out this is way over kill to do. I literally only did this 5 or 6 times when I was at Check Point and only once was it useful and this was to identify an issue.

_Val_
Admin
Admin

I would pretty much want to hear answer to this question as well. 

0 Kudos
G_W_Albrecht
Legend
Legend

Me, too - a reference would have done. And citing Shakespear would have been even more fun .

CCSE CCTE CCSM SMB Specialist
0 Kudos
Chinmaya_Naik
Advisor

Dear Team,

Sorry to everyone that I sharing this document on checkmates.

My intention is only to simplify the process as this activity I already did on one of the customer environment so I am sharing.

Yes, I already mentioned on the documents that  sk31511 as well.

If I am violating the rule and regulation then I am going to remove this post on checkmates.

As I am new on checkmates if I did any mistake to share the docs then again sorry.

0 Kudos
_Val_
Admin
Admin

CHINMAYA NAIK‌, do not worry, no rules are violated. We encourage any activity here that can help others in understanding and mastering Check Point technology. Your input here is appreciated. 

That said, title of your article is word for word the SK title, and the content is very close, which may raise questions. I think this is a matter of expressing oneself and sharing experience. I would suggest a rather different uproach. Like:

To troubleshoot <my issue> I am using a method described in <SK article>. Specifically for <my problem> I am using such and such commands and techniques from this SK. That part worked for me. That part did not.

This is only a suggestion.

Thanks for your efforts.

Chinmaya_Naik
Advisor

Thank You sir for your valuable comment 

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events