High CPU Spikes on 5100 Cluster

joeborg · ‎2023-02-18

Hi,

We've been having trouble with high CPU usage spikes on a 5100 cluster at one of our offices - on and off for a few months. These issues typically happen in the weekend (it so happens that this office is mostly used during the weekend) and the events generally only last for a few minutes.

When such issues occur, we typically notice the following:

- High CPU can be observed on CPView (see attached screenshot) on both cores.

- Generally network protocols seem to be affected - for example ISP redundancy is disturbed (to the extent that we've had to disable this as it was causing a snowball effect of issues) and BGP sessions dropped.

- There is nothing in the logs leading up to the event that would indicate any problem. I've checked /var/log/messages, dmesg, routed.log and routed_messages (the latter shows the dropped BGP sessions and ISP redundancy flaps but these are an effect of the high CPU, not a cause).

- Whilst the issues generally happen in the weekend in the afternoon, there is no exact/repeatable timestamp at which they occur (which means we cannot link what's happening to any specific process kicking off).

- Leading up to and after such events, the CPU generally sits somewhere between 40-60% so there's no indication of any impending issue.

Would you be able to help me troubleshoot this further as I'm at a bit of a loss as to what I could look at next?

Thanks,

Joe

-

_Val_ · ‎2023-02-18

What about the regular FW logs? The first thing to look for is an abnormal number of drop logs just before and during the issue.

Chris_Atkinson · ‎2023-02-18

Which version & JHF is this gateway deployed with?

If BFD is enabled for the BGP session is it configured in the PriorityQ settings per sk105762?

Additionally it may also be worthwhile reviewing the S7 commands output per:

https://community.checkpoint.com/t5/Scripts/S7PAC-Super-Seven-Performance-Assessment-Commands/td-p/4...

CCSM R77/R80/ELITE

joeborg · ‎2023-02-20

Hi,

Thank you both for the prompt replies and for the insightful questions. To answer them:

1. Regular FW logs don't seem to show anything abnormal. We've combed through them a few times but, other than drops which we see during normal operation, there's nothing that caught our eye (e.g. nothing that would suggest a DOS attack or anything of the sort).

2. Version R81.10:

Product version Check Point Gaia R81.10 Take 78
OS build 335
OS kernel version 3.10.0-957.21.3cpx86_64
OS edition 64-bit

3. We're not running BFD.

Thanks again for your assistance on this.

G_W_Albrecht · ‎2023-02-20

Look into the spike logs: sk166454: CPU Spike Detective

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

joeborg · ‎2023-02-20

Nothing in there corresponding to these two events :-(. The last log is from the 3rd of February. These events happened on the 18th.

Chris_Atkinson · ‎2023-02-20

How many routes do you see via BGP, a handful or thousands, more?

CCSM R77/R80/ELITE

joeborg · ‎2023-02-20

83 BGP routes installed in the route table. Around 472 in total received from all BGP peers.

Chris_Atkinson · ‎2023-02-20

At this stage I can only recommend the following further actions.

1. Update to JHF T87 which will address / eliminate the following (listed resolved from T82):

PRJ-41504,PMTR-75250

Routing - Some invalid nexthop and destination addresses from remote BGP peers may be incorrectly handled, causing lost BGP connection.

2. If the problem persist investigate further with TAC

3. Provide the S7 output mentioned in a previous post above.

CCSM R77/R80/ELITE

joeborg · ‎2023-02-20

Hi,

Thanks for this. s7 output attached though this is taken today so not sure it's still relevant. Unfortunately these incidents generally only last a few minutes and have usually resolved themselves by the time I'm called and login so it's near impossible for me to run this command whilst the issue is underway.

Noted re BGP - I doubt it as most routes are our internal routes over MPLS and don't change often. Moreover, other sites recieve them too without issues.

Timothy_Hall · ‎2023-02-20

You've got some slight overruns on your NICs that are a very low percentage of overall traffic, but other than that your box appears to be well-tuned but is simply not powerful enough to do all that you are asking of it. You only have 2 cores so both cores are pulling double duty in a 2/2 split, when the CPUs get saturated BGP will destabilize as there simply aren't enough processing resources to go around.

The Spike Detective is not logging anything as the CPUs are generally so busy that there is no single outlier that is consuming an inordinate amount of CPU compared to everything else. Try running fw ctl multik print_heavy_conn to see if there are any detected elephant flows in the last 24 hours, but it will almost certainly have the same issue as the Spike Detective and not show anything. In my opinion there is not much you can do other than get a more powerful box, as the 2-core Celeron G1820 in your 5100 isn't cutting it and you are getting all you will get out of that box.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

joeborg · ‎2023-02-21

Hi Timothy,

Many thanks for taking the time to look into it, I very much appreciate it. I'm also in agreement with you that the box is underdimensioned at this stage and probably needs replacement with something more powerful.

In the meantime, we've enabled Fast Acceleration on some flows as per https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut.... We're getting around 30% traffic being fast accelerated during peak and this has lowered CPU by some 20% or so; I'm hoping this buys us some respite until we figure out a replacement plan.

All this being said, the only nagging doubt I still have is due to sequence of events that occurs. When this issue happens, CPU goes from 50-60% (not low but hardly alarming) to being maxed out for a few minutes and then things go back dow to 50-60%. Had the cause been solely due to load, I would have expected a far more linear behaviour by way of CPU usage (e.g. 60 -> 70 -> 80 -> 90 over our peak usage hours). In my ignorance, what we're experiencing is more indicative of some event suddenly maxing out the CPU. The problem I have is that I've got no clue what this event might be...

Alex- · ‎2023-02-21

Watch out for policy installation tasks. I have a cluster of 5200 running R81.10 Take 87 that have the same CPU than the 5100 and even an accelerated policy installation with no changes creates a surge of 30% on the CPU usage. I've seen non-accelerated policy installation on that cluster increase CPU by 40% even.

That cluster sees very low usage with CPU at 5% most of the time so it decreases rather quickly after policy installation, but in your active setup such an increase might have a more lasting impact.

joeborg · ‎2023-02-21

Hi Alex,

Many thanks for this. The issues happen over the weekend asthis particular office works in the weekend. During the weekend we don't install any policies. Nonetheless, I've gone back and checked and can confirm there is no policy installation going on during the time frame.

I'm with you that these process tend to cause a spike in CPU usage;the software update check process is one such other one.

This being said, during these events,the logs show no such process coinciding.

Are you a member of CheckMates?

High CPU Spikes on 5100 Cluster