Cluster down and policy installation failure after...

MladenAntesevic · ‎2021-05-16

Hi,

We have recently upgraded 5600 cluster gateways up to R81 + JHF Take 23. The management is the same version and it is running on the Smart-1 410 appliance. Everything has been running OK till the moment we enabled HTTPS inspection just for a small group of test computers (bypassing all others). After doing a few basic tests (trying several web pages) we noticed a very unresponsive behavior happening randomly while surfing the Internet. Moreover, we have noticed very slow access policy installation, we even had a several installation failures because our cluster went down after we tried to install the policy. After disabling HTTPS inspection policy installation was successful and our cluster was OK. I have searched the logs and we found that some cluster interfaces have been down or partially down because CCP messages have not been decrypted correctly. Here is the relevant part of our log:

May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=23322, Caller=policy change timer (fwha_periodic_policy_state_check))
May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-120003-1: Cluster policy installation failed (failure event - timeout) - resume the old policy
May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-220011-1: freeze state on remote member 2 has changed from ON to OFF
May 16 11:17:27 2021 CP-FW-1 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 4 (time=23615, caller=fwha_hp_periodic_run: FWHA_CUL_POLICY_STATE_FREEZE_TIMEDOUT)
May 16 11:18:01 2021 CP-FW-1 kernel: [fw4_0];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1
May 16 11:18:04 2021 CP-FW-1 xpand[18365]: Configuration changed from localhost by user admin by the service dbset
May 16 11:18:04 2021 CP-FW-1 last message repeated 9 times
May 16 11:18:14 2021 CP-FW-1 kernel: [fw4_1];CLUS-220011-1: freeze state on remote member 2 has changed from OFF to ON
May 16 11:18:14 2021 CP-FW-1 kernel: [fw4_1];FW-1: [cul_policy_freeze][CUL - Member] fwha_cul_policy_freeze_state_change: set Policy Freeze [ON], FREEZING state machine at ACTIVE (time=24084, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE, freeze_timeout=300, freeze_event_timeout=150)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 4027287719/1286882944)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=24091, Caller=fwha_check_policy_state, Type=0 State=ACTIVE)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_0];CLUS-120126-1: Failed to decrypt CCP from member 1 on ifn 13, policy installation required.
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120126-1: Failed to decrypt CCP from member 1 on ifn 2, policy installation required.

the_rock · ‎2021-05-16

Cant say I had seen this yet, as I have not worked with any customers yet with R81 gateways (only R81 mgmt server), but one thing I would check in your case is when https is enabled, have a look to see if there are any odd wtlsd logs or if that process takes memory/cpu. Also, there is a legacy https dashboard you can open to confirm that all settings there are updated.

TAC may have updated debug procedure for https inspection.

Best,
Andy
"Have a great day and if its not, change it"

MladenAntesevic · ‎2021-05-17

We noticed CPU spikes caused by wstlsd and fw_full at the time we had enabled HTTPS Inspection:

May 16 10:31:45 2021 CP-FW-1 spike_detective: spike info: type: cpu, cpu core: 3, top consumer: wstlsd, start time: 16/05/21 10:31:38, spike duration (sec): 6, initial cpu usage: 99, average cpu usage: 99, perf taken: 0
May 16 10:31:45 2021 CP-FW-1 spike_detective: spike info: type: thread, thread id: 21203, thread name: wstlsd, start time: 16/05/21 10:31:38, spike duration (sec): 6, initial cpu usage: 97, average cpu usage: 97, perf taken: 0
May 16 10:32:25 2021 CP-FW-1 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: fw_full, start time: 16/05/21 10:31:44, spike duration (sec): 40, initial cpu usage: 85, average cpu usage: 88, perf taken: 0
May 16 10:32:25 2021 CP-FW-1 spike_detective: spike info: type: thread, thread id: 22875, thread name: fw_full, start time: 16/05/21 10:31:50, spike duration (sec): 34, initial cpu usage: 92, average cpu usage: 95, perf taken: 0

the_rock · ‎2021-05-17

I would definitely recommend you have TAC investigate that...

Best,
Andy
"Have a great day and if its not, change it"

_Val_ · ‎2021-05-17

It seems your active member is running very high CPU after HTTPSi is enabled.

Please make sure you have configured your testing HTTPSi policy according to best practices:

Destination is internet on all Inspect rules
Source are your test IPs/Networks
Services is set to Web services only, and not ANY
You put ANY-ANY-ANY-Bypass as your cleanup rule in the HTTPS inspection policy

MladenAntesevic · ‎2021-05-17

Thanks Val,

seems my cleanup bypass rule is wrong, this is my current HTTPSi policy. I will change it to ANY-ANY-ANY-Bypass and schedule a maintenance windows in order to test HTTPSi again.

the_rock · ‎2021-05-17

Just my personal opinion...I would not change predefined bypass rule if I were you. I seen people do this and always leads to problems. Not saying it would not work for you, but I agree with first 3 points Val made, maybe just not the 4th one : )

Best,
Andy
"Have a great day and if its not, change it"

MladenAntesevic · ‎2021-05-17

Hi @the_rock what would be correct bypass rule in your opinion?

_Val_ · ‎2021-05-18

@MladenAntesevic What I said 🙂

You can also look into here: https://community.checkpoint.com/t5/Security-Gateways/HTTPS-Inspection-Best-Practices-TechTalk-Video...

My recommendation is mentioned there as well, by our world know expert Peter Elmer.

_Val_ · ‎2021-05-18

BTW, your bypass rule is almost okay. But as I said above, it is important to exclude any internal SSL traffic that your GW might try to decrypt, and also any non-web TLS services.

Up to you, of course.

the_rock · ‎2021-05-18

In my opinion, you should leave it as what it is by default.

Best,
Andy
"Have a great day and if its not, change it"

_Val_ · ‎2021-05-18

@the_rock there is only one small problem, there is no built-in cleanup rule in HTTPSi policy.

This is how it looks, when you create it:

If you leave it like that, I can guarantee you very high CPU utilization 🙂

the_rock · ‎2021-05-18

Well, not really sure about it, as even TAC escalations team always recommends to leave default rule like what it is and then if bypass is needed to create it in similar fashion.

Best,
Andy
"Have a great day and if its not, change it"

_Val_ · ‎2021-05-18

Gimme an SR, I will check. Sounds more than weird.

_Val_ · ‎2021-05-18

If your cleanup rule in HTTPSi says Any-Any-Any-Inspect, or if you do not have one, HTTPSi will try decrypting all SSL traffic. That is the main error leading to CPU saturation.

You want your HTTPSi policy to be economical and to decrypt only traffic which is needed to be inspected. For outbound, it is the internal client scope to Internet only on TLS web services, and nothing else.

People frequently miss the fact HTTPSi is active before your network security rulebase. That means, even if traffic is later hit the drop rule, it will be decrypted first. Huge waste of effort, if policy is too liberal.

Are you a member of CheckMates?

Cluster down and policy installation failure after enabling HTTPS inspection