Solved: FW-1: [cul_load_freeze][CUL - Cluster] CUL should ...

Pedro_Amaral · ‎2018-08-17

Hi Mates,

All of the suden, like two days ago, our R80.10 Cluster with two 13500 appliances start to show these messages along the day. Until now we are not observing any effects on live traffic.

We are following related KB articles to check for performance issues and until now doesn't seem we have any constrains due to the traffic load on the gateways, which by the way, is pretty much more or less the same as before.

Despite the ongoing performance troubleshoot going on, our first question is, since we are not observing pratical impacts in the live traffic, is this behaviour relevant in clusters on Load-Sharing mode?

We found a couple of KBs about cluster flapping issues but they are only applicable on HA mode.

PS: I will post more details with our findings about the performance of the gateways in course due.

Many Thanks,

Pedro_Amaral · ‎2018-08-17

Hi Timothy,

Thanks for your help.

Dynamic Dispatching is indeed enabled in both gateways on the cluster.

fw ctl multik dynamic_dispatching get_mode
Current mode is On

We have CPU usage under close monitoring but since I've post my initial question on this thread, the issue didn't happen since.

Many Thanks to all!

View solution in original post

_Val_ · ‎2018-08-17

Some CPU core is above 80% utilization, and ClusterXL is trying to avoid a failover situation because of heavy load. For better understanding, could you please post the full message? It should mention a specific core and some other details.

_Val_ · ‎2018-08-17

To explain the CUL mechanism better, with both HA and Load Sharing ClusterXL monitors interface status and some other parameter of cluster with CCP messages. Multiple CCP messages are sent every second, and under heavy load some of them could be delayed in sending or receiving, if CPU is too busy with something else.

Usually, if neighbor cluster member does not sent messages anymore, it is considered nonoperational, and failover happens. CUL fail-safe is to prevent the situation of unnecessary failover in the case of heavy CPU utilization. It freezes failover mechanism for a moment of very heavy CPU utilization, to avoid cluster instability.

In your case one of the CPUs goes over 80% triggering CUL. In could be that it only stays over 80% for a short time while other core are practically idle, so you may miss such events when monitoring performance with standard tools.

Pedro_Amaral · ‎2018-08-17

Hi Valeri,

Thanks for your reply.

Please see below a complete chunk of those messages:

Aug 17 09:04:52 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL_CLUSTER is FREEZE_ON and should stay ON, updating time, kernel CPU usage [83%]
Aug 17 09:04:52 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:04:54 2018 XXXXXXXXXXXXX last message repeated 2 times
Aug 17 09:04:55 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL_CLUSTER is FREEZE_ON and should stay ON, updating time, kernel CPU usage [82%]
Aug 17 09:04:55 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:04:56 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 1 sec
onds ago
Aug 17 09:04:57 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:04:58 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:04:59 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 1 sec
onds ago
Aug 17 09:05:00 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:05:01 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:05:02 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 1 sec
onds ago
Aug 17 09:05:03 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:05:04 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:05:05 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 1 sec
onds ago
Aug 17 09:05:06 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 2 sec
onds ago
Aug 17 09:05:07 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 0 sec
onds ago
Aug 17 09:05:08 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 1 sec
onds ago
Aug 17 09:05:09 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 2 sec
onds ago
Aug 17 09:05:10 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 3 sec
onds ago
Aug 17 09:05:11 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 4 sec
onds ago
Aug 17 09:05:12 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 5 sec
onds ago
Aug 17 09:05:13 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 6 sec
onds ago
Aug 17 09:05:14 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 7 sec
onds ago
Aug 17 09:05:15 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 8 sec
onds ago
Aug 17 09:05:16 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF (short timeout of 10 seconds expired) but at least one member reported high CPU usage 9 sec
onds ago
Aug 17 09:05:16 2018 XXXXXXXXXXXXX kernel: [fw4_1];FW-1: [cul_load_freeze][CUL - Cluster] Changing CUL state to OFF, no Member reported high CPU usage for the past 10 seconds, CUL was ON for 71 secon
ds

---------------------------------------------------------------###############--------------------------------------------------------------------

From our investigations from cpview history we found that probably is core #15 reaching over 80%, this core is assigned to Firewall Instance 0

fw ctl affinity -l
Mgmt: CPU 1
Sync: CPU 2
eth1-01: CPU 0
eth2-04: CPU 1
eth3-01: CPU 3
eth3-02: CPU 2
eth3-04: CPU 3
Kernel fw_0: CPU 15
Kernel fw_1: CPU 14
Kernel fw_2: CPU 13
Kernel fw_3: CPU 12
Kernel fw_4: CPU 11
Kernel fw_5: CPU 10
Kernel fw_6: CPU 9
Kernel fw_7: CPU 8
Kernel fw_8: CPU 7
Kernel fw_9: CPU 6
Kernel fw_10: CPU 5
Kernel fw_11: CPU 4
Daemon fwd: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon mpdaemon: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon lpd: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon pepd: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon pdpd: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon cpd: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Daemon cprid: CPU 4 5 6 7 8 9 10 11 12 13 14 15
Interface eth2-01: has multi queue enabled
Interface eth2-02: has multi queue enabled

The problem here is that from 2 days ago until now we don't find anything in our env. that can justify the appearence of these spikes.

We have now a close monitoring on all cores utilization to check if the occurence of these messages are correlated with real core usage, since we had start this close monitoring the issue didn't happen again

Many Thanks,

Timothy_Hall · ‎2018-08-17

Valeri did a good job of describing the Cluster Under Load (CUL) function which is enabled by default in R77.30 and later, and it sounds like it is doing its job properly in your case.

In R77.30 and earlier, the lowest-numbered firewall instance (#0 - CPU 15 in your case) was the only Firewall Worker core that could process all IPSec VPN traffic and VoIP traffic. This could lead to situations where that particular core would be 100% utilized and cause a bottleneck for that type of traffic. However in R80.10 and later IPSec VPN and VoIP processing is automatically distributed among all the Firewall Worker cores. Also the Dynamic Dispatcher which helps balance the Firewall Worker core loads is enabled by default in R80.10, please run fw ctl multik dynamic_dispatching get_mode and make sure it hasn't been switched off.

It is possible in R80.10 that certain other operations can still only be handled on the lowest-numbered firewall instance and cause the imbalance you are seeing. If these CUL messages continue to occur, please run the following commands while it is happening and post the results here for further analysis, you may just need to do a bit of tuning to remediate this situation:

fw ctl multik stat

fwaccel stats -s

fwaccel stat

free -m

enabled_blades

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Pedro_Amaral · ‎2018-08-17

Hi Timothy,

Thanks for your help.

Dynamic Dispatching is indeed enabled in both gateways on the cluster.

fw ctl multik dynamic_dispatching get_mode
Current mode is On

We have CPU usage under close monitoring but since I've post my initial question on this thread, the issue didn't happen since.

Many Thanks to all!

_Val_ · ‎2018-08-17

If it does not happen again, there is no need to be sad 🙂

According to the messages above, core 15 stayed above 80% for more than 70 seconds. You may want to keep an eye on it, just in case.

JozkoMrkvicka · ‎2018-08-17

How can I determinate which core is affected ? Where from those logs I can see that the issue is with core #15 ?

[fw4_1]:FW-1 doesnt mean that the issue is with IPv4 core number 1 ? Or how correctly read this ?

Kind regards,
Jozko Mrkvicka

HeikoAnkenbrand · ‎2018-08-17

Hi Jozko

If you need a clue as to which CPU is overloaded over time, you can use the following command:

# sar -P ALL

PS:

This is a good command, you can also display other statistics over the time for memory "sar -r" and network "sar -n DEV".

Regards

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

_Val_ · ‎2018-08-18

It is the first core assigned to FW4

JozkoMrkvicka · ‎2018-08-18

Thanks for confirmation. In our environment, we suspect that CUL is triggered at 00:00 every day by IPv6 connection.

Should I see CUL events: [fw6_<CORE>]:FW-X ?

Can be cpview used to see IPv6 top connections? I never see IPv6 there, only ipv4.

Kind regards,
Jozko Mrkvicka

Are you a member of CheckMates?

FW-1: [cul_load_freeze][CUL - Cluster] CUL should be OFF...