R80.30 Standby node 100% cpu

rdegoix · ‎2020-04-06

Hello everyone,

One more time, I'm requesting your help for a strange behavior 😉

I have a cluster running on Gaia R80.30 (1 core), where my Primary (active) node is OK, running arround 7-10% of CPU (Production traffic is OK & running well, no complains from customer 😄 ).

But my stand-by node (passive), is running 100% CPU... And no specific process related to this high CPU...

I compared the following things for now, FWVE-1 is PASSIVE & FWVE-2 is ACTIVE one.

cpview (see screenshot as proof of the strange behavior, top_standby_fw.jpg)

Version (see screenshot version_compare_fw.jpg)

Product version Check Point Gaia R80.30
OS build 200
OS kernel version 2.6.18-92cpx86_64
OS edition 64-bit

cpinfo -y all (see screenshot, cpinfo y all compare fw.jpg)

top (see screenshot top_standby_fw.jpg)

cphaprob -a if and state (see screenshot)

Thanks in advance for your help on this & please let me know if I can provide more informations in order to help you to investigate.

Best regards,

Robin.

rdegoix · ‎2020-04-06

After reboot, it started to work, we were not able to find the root cause... But looks like it's solved !

Timothy_Hall · ‎2020-04-06

Wow that top output is strange, 78% CPU consumed in user/process space yet the top CPU-consuming process is only using 2.3%? It is a little unusual to only have one CPU, I assume CoreXL is disabled on both members? I'd theorize that the top output is bugged, but cpview is showing the same thing...

Only other thing I can think of is some kind of rapidly dying and respawning process eating CPU that is not showing up in top because it never lasts long enough, are there any process core dumps in /var/log/dump/usermode? Also check the $FWDIR/log/*.elg and $CPDIR/log/*.elg log files to see if there is any evidence of a rapidly dying and respawning process...

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

rdegoix · ‎2020-04-06

Hey Timothy,

appreciate your feedback on that 😉

Unfortunately, no file there (may be due to reboot?)

[Expert@PRO-FWVE-EXT-1:0]# pwd
/var/log/dump/usermode
[Expert@-PRO-FWVE-EXT-1:0]# ls -ls
total 0

I will give a look deeper about elg files on these directories as there are more volumes, hope I will have some lucks to understand what happened 😉

Thanks again !

Best regards,

Robin.

Andrea_Manrique · ‎2020-07-27

Hello!

I have r80. 20 and also the fw stand by has 100% CPU, can you find the solution?

G_W_Albrecht · ‎2020-07-27

Tried a reboot yet ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Dale_Lobb · ‎2020-07-27

We had a similar issue with a multi-core cluster under R80.20. The standby nodes would look fine immediately after reboot, but would eventually go to 1 CPU used at 100% with no no apparent top process. In addition, there was a very slow memory leak when in this situation. I had an open case with TAC, but we were unable to find a solution (TAC wanted to do a massive debug on all nodes, but management nixed the idea.) Eventually, we just resorted to rebooting the passive nodes once a month.

The issue stopped a couple of months ago. On the date it went away, we did two things: 1) we updated the cluster nodes to R80.20 HFA take_141 and we upgraded management from R80.20.M2 to R80.40 HFA take_48. I don't know which one fixed the issue, but it is gone now.

Andrea_Manrique · ‎2020-07-27

Hello to all;

We have the same scenario, we rebooting the FW but that is no the solution, we have the custer with take_141, so that not fix the problem, I will try to upgrade the mgmt and let you know.

G_W_Albrecht · ‎2020-07-27

Installing a Jumbo on SMS will usually not change memory leaks or strange CPU peaks 😉 - i would rather go with R80.20 Jumbo HotFix - Ongoing Take 173 (23 July 2020) (at least for testing). Apart from this, a reboot once a month is a very healthy strategy to keep all sound and safe - very high uptimes can literally destroy a unit...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

R80.30 Standby node 100% cpu