Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
ggiordano
Participant

R80.40 100% CPU

Jump to solution

Hi mates

Does someone meet performance issue after upgrade to R80.40?

At my 2 customers I faced out the same performance issue when I upgraded the cluster from R80.xx to R80.40 last take.

For both I had to downgrade to the previous version because critical environments where I cannot wait for TAC investigation.

this is the reason why I'm sharing my findings

In both cases I see the active cluster member suddenly has more CPUs 100% usage.

When it happens the gateway is unresponsive and the TOP output shows high usage for "watchdog" daemons.

Reverting to previous version, the performance on the gateway is as expected.

 

0 Kudos
1 Solution

Accepted Solutions
HeikoAnkenbrand
Champion
Champion

Hi @ggiordano,

You may be able to share the following information:
top (press 1)
fwaccel stats -s
fw ctl affinity -l
cpwd_admin list
more /var/log/messages | grep -B 2 -A 5 error
cpinfo -y all

Open Server or appliance?

PS:
I have also running many CusterXL with R80.40 without problems.

View solution in original post

9 Replies
JackPrendergast
Collaborator

Hi,

 

Firstly a few questions.

 

What hardware are you upgrading?

What version are you upgrading from?

also, when you say ‘watchdog’ daemons - are you referring to any of the daemons monitored by watchdog? Or are you referring to ‘cpwd’ running at 100%?

any other log files collected you could share?

 

Seems suspicious either way. I’ve upgraded countless clusters to R80.40 without a hitch.

0 Kudos
ggiordano
Participant

Hi

the upgrade was performed from R80.10.

in a case the cluster is based on 15600 appliances and the other case the cluster is based on 5600 appliances.

TheTOP output, when I meet the issue, I saw 2 "watchdog" processes are 100%

Unfortunately I didn't get any log files.

The messages log file showed errors about GNAT isn't able to de-allocate resources. This issue was mitigated disabling the GNAT feature, but it didn't fix the issue

0 Kudos
HeikoAnkenbrand
Champion
Champion

Hi @ggiordano,

You may be able to share the following information:
top (press 1)
fwaccel stats -s
fw ctl affinity -l
cpwd_admin list
more /var/log/messages | grep -B 2 -A 5 error
cpinfo -y all

Open Server or appliance?

PS:
I have also running many CusterXL with R80.40 without problems.

View solution in original post

ggiordano
Participant

unfortunately I cannot provide the output because I downgraded the cluster to R80.30 because the business impact was very high

0 Kudos
Daniel_
Collaborator

I also have watchdog on 100% on 80.40 take91 sometimes

watchdog.png

It looks like we have a CIFS connection with high load in IPS. But why watchdog is on 100%?

0 Kudos
Timothy_Hall
Champion
Champion

Wow that is strange that the watchdog is eating CPU like that, and at real-time priority no less which will constantly kick other processes (like fw_worker_X) off the CPU and cause astronomical amounts of context switching thus degrading performance.  The watchdog is a Gaia/Linux program (not Check Point product code) that ensures the system has not hung by running a series of sanity tests and writing to the /dev/watchdog file (called "kicking the watchdog") at least once a minute.  If it fails to perform this write in a timely fashion the system is assumed to be hung, the watchdog barks then bites which forcibly reboots the system from the hardware level.

What kind of hardware are you using?  Can't see why watchdog would need so much CPU unless there is some kind of hardware issue, you should get TAC involved on this pronto as the watchdog is most certainly NOT a process you want to have issues with, as it can affect the stability of the system or even the ability to gracefully recover from a hard hang.  If a system is hard hung and the watchdog does not bark then bite to reset it, the only recourse is physically pulling the power plug. 

Perhaps the watchdog is chasing its tail after someone spiked its water dish with Red Bull or something.  🙂

"Max Capture: Know Your Packets" Video Series
now available at http://www.maxpowerfirewalls.com
Daniel_
Collaborator

Thanks for this answer. 

It's an OpenServer ("IBM System x3650 M4: -[7915J6G]-").

TAC is already involved. I have send TAC this discussion 😉

0 Kudos
Timothy_Hall
Champion
Champion

I figured you must have been on open hardware; would be very surprised to see this kind of watchdog issue on a Check Point gateway appliance.

"Max Capture: Know Your Packets" Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
Daniel_
Collaborator

The original post was from ggiordano. He has been using appliances.

0 Kudos