Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
MladenAntesevic
Collaborator

Cluster down and policy installation failure after enabling HTTPS inspection

Hi,

We have recently upgraded 5600 cluster gateways up to R81 + JHF Take 23. The management is the same version and it is running on the Smart-1 410 appliance. Everything has been running OK till the moment we enabled HTTPS inspection just for a small group of test computers (bypassing all others). After doing a few basic tests (trying several web pages) we noticed a very unresponsive behavior happening randomly while surfing the Internet. Moreover, we have noticed  very slow access policy installation, we even had a several installation failures because our cluster went down after we tried to install the policy. After disabling HTTPS inspection policy installation was successful and our cluster was OK. I have searched the logs and we found that some cluster interfaces have been down or partially down because CCP messages have not been decrypted correctly. Here is the relevant part of our log:

May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-120009-1: Cluster policy installation state freeze OFF (Time=23322, Caller=policy change timer (fwha_periodic_policy_state_check))
May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-120003-1: Cluster policy installation failed (failure event - timeout) - resume the old policy
May 16 11:16:58 2021 CP-FW-1 kernel: [fw4_1];CLUS-220011-1: freeze state on remote member 2 has changed from ON to OFF
May 16 11:17:27 2021 CP-FW-1 kernel: [fw4_1];FW-1: [CUL - Member] Policy Freeze mechanism disabled, Enabling state machine at 4 (time=23615, caller=fwha_hp_periodic_run: FWHA_CUL_POLICY_STATE_FREEZE_TIMEDOUT)
May 16 11:18:01 2021 CP-FW-1 kernel: [fw4_0];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1
May 16 11:18:04 2021 CP-FW-1 xpand[18365]: Configuration changed from localhost by user admin by the service dbset
May 16 11:18:04 2021 CP-FW-1 last message repeated 9 times
May 16 11:18:14 2021 CP-FW-1 kernel: [fw4_1];CLUS-220011-1: freeze state on remote member 2 has changed from OFF to ON
May 16 11:18:14 2021 CP-FW-1 kernel: [fw4_1];FW-1: [cul_policy_freeze][CUL - Member] fwha_cul_policy_freeze_state_change: set Policy Freeze [ON], FREEZING state machine at ACTIVE (time=24084, caller=fwioctl: FWHA_CUL_POLICY_STATE_FREEZE, freeze_timeout=300, freeze_event_timeout=150)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 4027287719/1286882944)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120008-1: Cluster policy installation state freeze ON (Time=24091, Caller=fwha_check_policy_state, Type=0 State=ACTIVE)
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_0];CLUS-120126-1: Failed to decrypt CCP from member 1 on ifn 13, policy installation required.
May 16 11:18:15 2021 CP-FW-1 kernel: [fw4_1];CLUS-120126-1: Failed to decrypt CCP from member 1 on ifn 2, policy installation required.

0 Kudos
14 Replies
the_rock
Leader
Leader

Cant say I had seen this yet, as I have not worked with any customers yet with R81 gateways (only R81 mgmt server), but one thing I would check in your case is when https is enabled, have a look to see if there are any odd wtlsd logs or if that process takes memory/cpu. Also, there is a legacy https dashboard you can open to confirm that all settings there are updated.

TAC may have updated debug procedure for https inspection.

0 Kudos
MladenAntesevic
Collaborator

We noticed CPU spikes caused by wstlsd and fw_full at the time we had enabled HTTPS Inspection:

 

May 16 10:31:45 2021 CP-FW-1 spike_detective: spike info: type: cpu, cpu core: 3, top consumer: wstlsd, start time: 16/05/21 10:31:38, spike duration (sec): 6, initial cpu usage: 99, average cpu usage: 99, perf taken: 0
May 16 10:31:45 2021 CP-FW-1 spike_detective: spike info: type: thread, thread id: 21203, thread name: wstlsd, start time: 16/05/21 10:31:38, spike duration (sec): 6, initial cpu usage: 97, average cpu usage: 97, perf taken: 0
May 16 10:32:25 2021 CP-FW-1 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: fw_full, start time: 16/05/21 10:31:44, spike duration (sec): 40, initial cpu usage: 85, average cpu usage: 88, perf taken: 0
May 16 10:32:25 2021 CP-FW-1 spike_detective: spike info: type: thread, thread id: 22875, thread name: fw_full, start time: 16/05/21 10:31:50, spike duration (sec): 34, initial cpu usage: 92, average cpu usage: 95, perf taken: 0

0 Kudos
the_rock
Leader
Leader

I would definitely recommend you have TAC investigate that...

0 Kudos
_Val_
Admin
Admin

It seems your active member is running very high CPU after HTTPSi is enabled.

Please make sure you have configured your testing HTTPSi policy according to best practices:

  1. Destination is internet on all Inspect rules
  2. Source are your test IPs/Networks
  3. Services is set to Web services only, and not ANY
  4. You put ANY-ANY-ANY-Bypass as your cleanup rule in the HTTPS inspection policy 
MladenAntesevic
Collaborator

Thanks Val,

seems my cleanup bypass rule is wrong, this is my current HTTPSi policy. I will change it to ANY-ANY-ANY-Bypass and schedule a maintenance windows in order to test HTTPSi again.

0 Kudos
the_rock
Leader
Leader

Just my personal opinion...I would not change predefined bypass rule if I were you. I seen people do this and always leads to problems. Not saying it would not work for you, but I agree with first 3 points Val made, maybe just not the 4th one : )

0 Kudos
MladenAntesevic
Collaborator

Hi @the_rock what would be correct bypass rule in your opinion?

0 Kudos
_Val_
Admin
Admin

@MladenAntesevic What I said 🙂

You can also look into here: https://community.checkpoint.com/t5/Security-Gateways/HTTPS-Inspection-Best-Practices-TechTalk-Video...

My recommendation is mentioned there as well, by our world know expert Peter Elmer.

0 Kudos
_Val_
Admin
Admin

BTW, your bypass rule is almost okay. But as I said above, it is important to exclude any internal SSL traffic that your GW might try to decrypt, and also any non-web TLS services. 

Up to you, of course.

0 Kudos
the_rock
Leader
Leader

In my opinion, you should leave it as what it is by default.

0 Kudos
_Val_
Admin
Admin

@the_rock there is only one small problem, there is no built-in cleanup rule in HTTPSi policy.

This is how it looks, when you create it:

Screenshot 2021-05-18 at 18.17.02.png

If you leave it like that, I can guarantee you very high CPU utilization 🙂

0 Kudos
the_rock
Leader
Leader

Well, not really sure about it, as even TAC escalations team always recommends to leave default rule like what it is and then if bypass is needed to create it in similar fashion. 

0 Kudos
_Val_
Admin
Admin

Gimme an SR, I will check. Sounds more than weird.

0 Kudos
_Val_
Admin
Admin

If your cleanup rule in HTTPSi says Any-Any-Any-Inspect, or if you do not have one, HTTPSi will try decrypting all SSL traffic. That is the main error leading to CPU saturation. 

You want your HTTPSi policy to be economical and to decrypt only traffic which is needed to be inspected. For outbound, it is the internal client scope to Internet only on TLS web services, and nothing else.


People  frequently miss the fact HTTPSi is active before your network security rulebase. That means, even if traffic is later hit the drop rule, it will be decrypted first. Huge waste of effort, if policy is too liberal.