Solved: R80.40 100% CPU

ggiordano · ‎2020-11-23

Hi mates

Does someone meet performance issue after upgrade to R80.40?

At my 2 customers I faced out the same performance issue when I upgraded the cluster from R80.xx to R80.40 last take.

For both I had to downgrade to the previous version because critical environments where I cannot wait for TAC investigation.

this is the reason why I'm sharing my findings

In both cases I see the active cluster member suddenly has more CPUs 100% usage.

When it happens the gateway is unresponsive and the TOP output shows high usage for "watchdog" daemons.

Reverting to previous version, the performance on the gateway is as expected.

HeikoAnkenbrand · ‎2020-11-23

Hi @ggiordano,

You may be able to share the following information:
top (press 1)
fwaccel stats -s
fw ctl affinity -l
cpwd_admin list
more /var/log/messages | grep -B 2 -A 5 error
cpinfo -y all

Open Server or appliance?

PS:
I have also running many CusterXL with R80.40 without problems.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

JackPrendergast · ‎2020-11-23

Hi,

Firstly a few questions.

What hardware are you upgrading?

What version are you upgrading from?

also, when you say ‘watchdog’ daemons - are you referring to any of the daemons monitored by watchdog? Or are you referring to ‘cpwd’ running at 100%?

any other log files collected you could share?

Seems suspicious either way. I’ve upgraded countless clusters to R80.40 without a hitch.

ggiordano · ‎2020-11-24

Hi

the upgrade was performed from R80.10.

in a case the cluster is based on 15600 appliances and the other case the cluster is based on 5600 appliances.

TheTOP output, when I meet the issue, I saw 2 "watchdog" processes are 100%

Unfortunately I didn't get any log files.

The messages log file showed errors about GNAT isn't able to de-allocate resources. This issue was mitigated disabling the GNAT feature, but it didn't fix the issue

HeikoAnkenbrand · ‎2020-11-23

Hi @ggiordano,

You may be able to share the following information:
top (press 1)
fwaccel stats -s
fw ctl affinity -l
cpwd_admin list
more /var/log/messages | grep -B 2 -A 5 error
cpinfo -y all

Open Server or appliance?

PS:
I have also running many CusterXL with R80.40 without problems.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

ggiordano · ‎2020-11-24

unfortunately I cannot provide the output because I downgraded the cluster to R80.30 because the business impact was very high

Daniel_ · ‎2021-03-04

I also have watchdog on 100% on 80.40 take91 sometimes

It looks like we have a CIFS connection with high load in IPS. But why watchdog is on 100%?

Timothy_Hall · ‎2021-03-04

Wow that is strange that the watchdog is eating CPU like that, and at real-time priority no less which will constantly kick other processes (like fw_worker_X) off the CPU and cause astronomical amounts of context switching thus degrading performance. The watchdog is a Gaia/Linux program (not Check Point product code) that ensures the system has not hung by running a series of sanity tests and writing to the /dev/watchdog file (called "kicking the watchdog") at least once a minute. If it fails to perform this write in a timely fashion the system is assumed to be hung, the watchdog barks then bites which forcibly reboots the system from the hardware level.

What kind of hardware are you using? Can't see why watchdog would need so much CPU unless there is some kind of hardware issue, you should get TAC involved on this pronto as the watchdog is most certainly NOT a process you want to have issues with, as it can affect the stability of the system or even the ability to gracefully recover from a hard hang. If a system is hard hung and the watchdog does not bark then bite to reset it, the only recourse is physically pulling the power plug.

Perhaps the watchdog is chasing its tail after someone spiked its water dish with Red Bull or something. 🙂

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Daniel_ · ‎2021-03-05

Thanks for this answer.

It's an OpenServer ("IBM System x3650 M4: -[7915J6G]-").

TAC is already involved. I have send TAC this discussion 😉

Timothy_Hall · ‎2021-03-05

I figured you must have been on open hardware; would be very surprised to see this kind of watchdog issue on a Check Point gateway appliance.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Daniel_ · ‎2021-03-05

The original post was from ggiordano. He has been using appliances.

Fameen · ‎2024-11-16

I am experiencing this same challenge with watchdog eating up 100% after upgrading my 23800 HA setup from 80.40 to 81.10.

The active member always have this intermittent SIC connections to the Management Server. Once we failover to the standby member as active, the same SIC issue begins to manifest.

we only push policies to the standby member independently then failover to the other member to push the policy package as well.

its painstaking.

_Val_ · ‎2024-11-18

@Fameen I suggest you start a new discussion about your issue. The version and symptoms are quite different from this post.

Are you a member of CheckMates?

R80.40 100% CPU