Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
rdegoix
Participant

R80.30 Standby node 100% cpu

Hello everyone, 

   One more time, I'm requesting your help for a strange behavior 😉

I have a cluster running on Gaia R80.30 (1 core), where my Primary (active) node is OK, running arround 7-10% of CPU (Production traffic is OK & running well, no complains from customer 😄 ).

But my stand-by node (passive), is running 100% CPU... And no specific process related to this high CPU...

I compared the following things for now, FWVE-1 is PASSIVE & FWVE-2 is ACTIVE one.

cpview (see screenshot as proof of the strange behavior, top_standby_fw.jpg)

Version (see screenshot version_compare_fw.jpg)

Product version Check Point Gaia R80.30
OS build 200
OS kernel version 2.6.18-92cpx86_64
OS edition 64-bit

cpinfo -y all (see screenshot, cpinfo y all compare fw.jpg)

top (see screenshot top_standby_fw.jpg)

cphaprob -a if and state (see screenshot)

 

 

Thanks in advance for your help on this & please let me know if I can provide more informations in order to help you to investigate.

Best regards,

 

Robin.

 

 

0 Kudos
8 Replies
rdegoix
Participant

After reboot, it started to work, we were not able to find the root cause... But looks like it's solved !
0 Kudos
Timothy_Hall
Champion
Champion

Wow that top output is strange, 78% CPU consumed in user/process space yet the top CPU-consuming process is only using 2.3%?  It is a little unusual to only have one CPU, I assume CoreXL is disabled on both members? I'd theorize that the top output is bugged, but cpview is showing the same thing...

Only other thing I can think of is some kind of rapidly dying and respawning process eating CPU that is not showing up in top because it never lasts long enough, are there any process core dumps in /var/log/dump/usermode?  Also check the $FWDIR/log/*.elg and $CPDIR/log/*.elg log files to see if there is any evidence of a rapidly dying and respawning process...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
rdegoix
Participant

Hey Timothy,

 

 appreciate your feedback on that 😉

Unfortunately, no file  there (may be due to reboot?) 

[Expert@PRO-FWVE-EXT-1:0]# pwd
/var/log/dump/usermode
[Expert@-PRO-FWVE-EXT-1:0]# ls -ls
total 0

 

I will give a look deeper about elg files on these directories as there are more volumes, hope I will have some lucks to understand what happened 😉

 

Thanks again !

 

Best regards,


Robin.

 

0 Kudos
Andrea_Manrique
Participant

Hello!

I have r80. 20 and also the fw stand by has 100% CPU, can you find the solution? 

0 Kudos
G_W_Albrecht
Legend
Legend

Tried a reboot yet ?

CCSE CCTE CCSM SMB Specialist
0 Kudos
Dale_Lobb
Advisor

We had a similar issue with a multi-core cluster under R80.20.  The standby nodes would look fine immediately after reboot, but would eventually go to 1 CPU used at 100% with no no apparent top process.  In addition, there was a very slow memory leak when in this situation.  I had an open case with TAC, but we were unable to find a solution (TAC wanted to do a massive debug on all nodes, but management nixed the idea.)  Eventually, we just resorted to rebooting the passive nodes once a month.

The issue stopped a couple of months ago.  On the date it went away, we did two things:  1) we updated the cluster nodes to R80.20 HFA take_141 and we upgraded management from R80.20.M2 to R80.40 HFA take_48.  I don't know which one fixed the issue, but it is gone now.

 

0 Kudos
Andrea_Manrique
Participant

Hello to all;

 

We have the same scenario, we rebooting the FW but that is no the solution, we have the custer with take_141, so that not fix the problem, I will try to upgrade the mgmt and let you know.

0 Kudos
G_W_Albrecht
Legend
Legend

Installing a Jumbo on SMS will usually not change memory leaks or strange CPU peaks 😉 - i would rather go with R80.20 Jumbo HotFix - Ongoing Take 173 (23 July 2020) (at least for testing). Apart from this, a reboot once a month is a very healthy strategy to keep all sound and safe - very high uptimes can literally destroy a unit...

CCSE CCTE CCSM SMB Specialist
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events