Hello community,
we moved to R80 in February and since then from time to time we receive alert Mail from our SMS (R80.30) that it lost connection to the active gateway of our Check Point Cluster (R80.10).
The result of my investigation was a high CPU Load (100%) on all cores due to high load on the fw_worker processes across this period.
This issue had impact to all parts of the network which are routed through the firewall. We had increased latency in the network and our SMS couldn't get data from the fw node. That is why we have a brake in the graph of SmartView Monitor and had to investigate on the affected node.
In the affected period I recognized an increased amount of inbound packets/sec on our external interface with CPVIEW history. Futhermore I saw also a rise of packets/sec handled by slow path (FW). The amount of inbound packets on the external interface and packets handled by slow path are quite close.
I created a CPInfo file with export of the CPVIEW History for visualization and compared the graph of fw_inbound packets and system_performance and they correlate.
That leads me to the conclusion that the packets has been inspected by default inspection but i'm not able to find information about inspected packets in the logging and although dynamic dispatcher is active, in CPVIEW no Top-Connections are listet under CPVIEW.CPU.Top-Connections.
I'm not sure how to find the connections that was responsible to that behaviour.
Here some values of CPVIEW of the second before and while high CPU load.
CPVIEW.Overview 25Mar2020 14:17:46
|---------------------------------------|
| Num of CPUs: 6 |
| CPU Used |
| 2 45% |
| 1 44% |
| 3 40% |
|---------------------------------------|
| CPU: |
| CPU User System Idle I/O wait Interrupts |
| 0 0% 25% 75% 0% 41,234 |
| 1 12% 32% 56% 0% 41,234 |
| 2 17% 28% 55% 0% 41,234 |
| 3 13% 27% 60% 0% 41,234 |
| 4 0% 1% 99% 0% 41,234 |
| 5 0% 0% 100% 0% 41,234
|---------------------------------------|
| Traffic Rate: |
| Total FW PXL SecureXL |
| Inbound packets/sec 155K 9,255 1,432 145K |
| Outbound packets/sec 156K 9,805 1,432 145K |
| Inbound bits/sec 958M 6,380K 10,354K 941M |
| Outbound bits/sec 1,002M 33,917K 10,537K 958M |
CPVIEW.Overview 25Mar2020 14:17:47
|---------------------------------------|
| Num of CPUs: 6 |
| CPU Used |
| 1 89% |
| 2 89% |
| 3 89% |
|---------------------------------------|
| CPU: |
| CPU User System Idle I/O wait Interrupts |
| 0 0% 29% 71% 0% 39,075 |
| 1 22% 67% 10% 0% 39,075 |
| 2 4% 85% 11% 0% 39,075 |
| 3 3% 86% 11% 0% 39,075 |
| 4 0% 0% 100% 0% 39,075 |
| 5 0% 0% 100% 0% 39,075 |
|---------------------------------------|
| Traffic Rate: |
| Total FW PXL SecureXL |
| Inbound packets/sec 182K 30,459 1,032 150K |
| Outbound packets/sec 157K 5,877 1,032 150K |
| Inbound bits/sec 1,010M 12,437K 7,394K 991M |
| Outbound bits/sec 1,040M 24,715K 7,526K 1,008M |
I also attached the graph of CPU-Load (one core), fw_inbound, RX on external interface.
I did a manual failover to see if the cpu load is just an issue of one node. You can see it in the graph.
The load suddenly went down at around 13:15 at the 26 of march.
I hope you have ideas for further investigation or preventing this. I thought about creating an own inspection profile and set the most actions to inactive.
Thanks in advance and best regards. Stay healthy
Martin Reppich
System Administrator
Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum GFZ