Hello,
I'd appreciate any tips or hints in finding out the traffic that causes cluster fail-over for some minutes.
We use R81.20 with the latest Take (105) on open servers. Every night, between 3am and 4am, there is high CPU observed for some minutes which triggers fail-over to the standby node, back and forth.
It looks like some sort of scheduled backup happening, but I can't figure out IP addresses involved.
I get the following in /var/log/messages:
Jul 24 03:46:10 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (82%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:46:41 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:46:43 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:46:55 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:47:00 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:47:19 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 19 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 03:47:40 2025 GW-N02 kernel:[fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
Jul 24 03:47:41 2025 GW-N02 kernel:[fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Jul 24 03:47:49 2025 GW-N02 kernel:[fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (81%) on the remote member 2 increased above the configured threshold (80%).
Jul 24 03:48:01 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 04:04:54 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (85%) on the local member increased above the configured threshold (80%).
Jul 24 04:05:58 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 31 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Jul 24 04:06:02 2025 GW-N02 kernel:[fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (92%) on the local member increased above the configured threshold (80%).
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Jul 24 04:06:20 2025 GW-N02 kernel:[fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Jul 24 04:06:36 2025 GW-N02 kernel:[fw4_1];CLUS-120202-1: Stopping CUL mode after 18 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
Trying to replay with 'cpview -t' doesn't help much.
At 03:46-03:47 no high CPU is recorded (I think it didn't fit into the 1 minute interval used with 'cpview -t'):

At 04:06, I do see high load which goes to normal 2 minutes after:



I was hoping to get some stats from CPU > Top-Connections, but nothing was logged:

Thank you in advance.