Hi all,
We have automated alerts setup with our SNMP monitoring platform, so that if our checkpoint gateways exceed 80%+ CPU utilization for a duration of 10-minutes or longer, we receive an alert.
The alerts generally trigger quite frequently, although it's very inconsistent. A particular gateway might trigger for multiple days, then nothing. There isn't really any pattern.
I've tried to look into the alerts to understand if this is "normal" operation, or something which needs further investigation.
My method has been to use cpview -t to check the historical util, this shows some information such as CPU type (e.g. CoreXL_FW).
I also check the /var/log/spike_detective logs, but I find the process information doesn't mean much to me, e.g.:
spike info: type: thread, thread id: 86227, thread name: fwk3_3, start time: 21/02/23 05:11:54, spike duration (sec): 29, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
I wondered if others have alternative methods to investigate high CPU utilization to understand the cause? Or is it quite normal to have frequent "spikes" and periods of high CPU in normal operation? If so perhaps we need to tweak our alert threshold.
Thanks in advance, and I appreciate it's quite a variable question depending on factors like throughput, active blades, model etc..