<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster member freeze issue in Firewall and Security Management</title>
    <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49995#M3723</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Does the console on the standby still respond during the "freeze"?&amp;nbsp; Or do you have to pull the power cord to recover?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The console on the standby doesn't respond during freeze unless we boot it into online debug mode(kdb mode). We hard reboot when the freeze occurs.&lt;/P&gt;</description>
    <pubDate>Sat, 06 Apr 2019 14:10:48 GMT</pubDate>
    <dc:creator>amith_rao</dc:creator>
    <dc:date>2019-04-06T14:10:48Z</dc:date>
    <item>
      <title>Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49975#M3716</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;We are facing a peculiar issue with our R80.20 cluster.&lt;/P&gt;&lt;P&gt;Hardware: 5900 appliance&lt;/P&gt;&lt;P&gt;OS/Version: GAIA R80.20&lt;/P&gt;&lt;P&gt;Blades Enabled: Firewall, IPS and Anti-bot.&lt;/P&gt;&lt;P&gt;Every week at least&amp;nbsp;once one of the cluster members freezes, always standby member and only comes up after a reboot.&lt;/P&gt;&lt;P&gt;When we check the health using CPview history during the time of the issue say CPU, RAM, Connections, Hmem, Smem, Kmem&amp;nbsp;failed allocation, all seems fine and in fact, the CPU is hardly 10% utilized, RAM 10%, Connections less than 10,000.&lt;/P&gt;&lt;P&gt;Currently, R&amp;amp;D is involved and working on this. Based on their analysis we have disabled priority queue, drop optimization but no luck.&lt;/P&gt;&lt;P&gt;Would be helpful if you can bring in your expertise to narrow down the issue while R&amp;amp;D continues its investigation.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 08:12:41 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49975#M3716</guid>
      <dc:creator>amith_rao</dc:creator>
      <dc:date>2019-04-06T08:12:41Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49987#M3717</link>
      <description>&lt;P&gt;- Is jumbo hotfix 47 installed?&lt;/P&gt;
&lt;P&gt;- Any error in /var/log/messages&lt;/P&gt;
&lt;P&gt;- If you have only 10000 connections disable SecureXL and check it again.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:11:36 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49987#M3717</guid>
      <dc:creator>HeikoAnkenbrand</dc:creator>
      <dc:date>2019-04-06T13:11:36Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49988#M3718</link>
      <description>&lt;P&gt;FYI:&lt;BR /&gt;SecureXL has been significantly revised in R80.20. It now works in user space. This has also led to some changes in "fw monitor", The SecureXL driver takes a certain amount of kernel memory per core and that was adding up to more kernel memory than Intel/Linux was allowing.&lt;/P&gt;
&lt;P&gt;More infos here:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.checkpoint.com/t5/General-Topics/R80-x-Security-Gateway-Architecture-Logical-Packet-Flow/td-p/41747" target="_self"&gt;R80.x Security Gateway Architecture (Logical Packet Flow)&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:18:57 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49988#M3718</guid>
      <dc:creator>HeikoAnkenbrand</dc:creator>
      <dc:date>2019-04-06T13:18:57Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49989#M3719</link>
      <description>&lt;P&gt;I think the R&amp;amp;D is the rigth way.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:19:33 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49989#M3719</guid>
      <dc:creator>HeikoAnkenbrand</dc:creator>
      <dc:date>2019-04-06T13:19:33Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49992#M3720</link>
      <description>&lt;P&gt;Please find the answers/comments in line&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;- Is jumbo hotfix 47 installed?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Currently Jumbo take 33 installed and take 47 have no resolved issue with respect to freezing.&lt;/P&gt;&lt;P&gt;- Any error in /var/log/messages&lt;/P&gt;&lt;P&gt;During the time of the freeze&amp;nbsp;till reboot, no relevant information found in var/log/message. In fact, no information at all during the time of the freeze.&lt;/P&gt;&lt;P&gt;- If you have only 10000 connections disable SecureXL and check it again.&lt;/P&gt;&lt;P&gt;The freeze is most often observed early in the morning, so in the day time, the traffic surpasses more than 56000 so securexl cannot be disabled.&lt;/P&gt;&lt;P&gt;But if securexl&amp;nbsp;is to cause an issue it should be on the active member but why is the standby member which is not handling any traffic is getting affected affected&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:30:22 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49992#M3720</guid>
      <dc:creator>amith_rao</dc:creator>
      <dc:date>2019-04-06T13:30:22Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49993#M3721</link>
      <description>&lt;P&gt;Does the console on the standby still respond during the "freeze"?&amp;nbsp; Or do you have to pull the power cord to recover?&lt;/P&gt;
&lt;P&gt;When you say "always standby member" do you mean that the issue always occurs on whatever member happens to be standby, and it has happened on both pieces of hardware?&amp;nbsp; Or does it always happen on the same piece of hardware that is standby?&amp;nbsp; If the latter check the hardware sensor data, I believe you can look at historical sensor data right from cpview in R80.20 and later.&lt;/P&gt;
&lt;P&gt;What do the commands &lt;STRONG&gt;cphaprob stat&lt;/STRONG&gt;, &lt;STRONG&gt;cphaprob -a if&lt;/STRONG&gt; and &lt;STRONG&gt;cphaprob -l list &lt;/STRONG&gt;display while the standby member is in its afflicted state?&amp;nbsp; Does ClusterXL still report everything is OK or does it report a failure?&amp;nbsp; What I would try to do in this case is determine if it is ClusterXL itself misbehaving, or the underlying firewall infrastructure that is experiencing a problem and ClusterXL is just reporting it.&amp;nbsp; Based on the troubleshooting steps so far it sounds like TAC suspects something in the underlying firewall code.&amp;nbsp; I assume TAC has already looked in /var/log/messages* for any smoking guns?&lt;/P&gt;
&lt;P&gt;Is the standby member experiencing issues with the Sync interface specifically?&amp;nbsp; If so see these threads:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://community.checkpoint.com/t5/General-Topics/Issue-on-the-sync-interface/m-p/30640" target="_blank" rel="noopener"&gt;https://community.checkpoint.com/t5/General-Topics/Issue-on-the-sync-interface/m-p/30640&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.cpug.org/forums/showthread.php/22679-HA-Failover-appears-to-be-caused-by-sync-interface" target="_blank" rel="noopener"&gt;https://www.cpug.org/forums/showthread.php/22679-HA-Failover-appears-to-be-caused-by-sync-interface&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:41:28 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49993#M3721</guid>
      <dc:creator>Timothy_Hall</dc:creator>
      <dc:date>2019-04-06T13:41:28Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49994#M3722</link>
      <description>&lt;P&gt;Hi, Timothy&lt;/P&gt;&lt;P&gt;We have faced this issue on both piece of hardware. i.e Any cluster member which happens to be in standby mode freezes.&lt;/P&gt;&lt;P&gt;In CPview history we are unable to see the hardware sensor readings like CPU temperature fan speed etc.&lt;/P&gt;&lt;P&gt;The clusterxl commands reports issue.&lt;/P&gt;&lt;P&gt;Command outputs&lt;/P&gt;&lt;P&gt;#cphaprob stat&lt;/P&gt;&lt;P&gt;Member1 - Active Attention&lt;BR /&gt;Member 2 - Lost.&lt;/P&gt;&lt;P&gt;#cphaprob -a&lt;/P&gt;&lt;P&gt;Out of the 15 interfaces, we see 3 interfaces in downstate which include Sync interface also. The same 3 interfaces show down during every freeze incident.&lt;/P&gt;&lt;P&gt;#cphaprob -l list&lt;/P&gt;&lt;P&gt;All ok on Member 1&lt;BR /&gt;Member 2 not accessible.&lt;/P&gt;&lt;P&gt;We see no write up in the var/log/messages from the time of freeze till the box is rebooted.&lt;/P&gt;&lt;P&gt;Is the standby member experiencing issues with the Sync interface specifically?&lt;/P&gt;&lt;P&gt;Nothing related to Sync issue but we could see some RX buffer overrun on Sync interface. Since the Sync between the cluster member&amp;nbsp;was connected back to back, we changed this by connecting them through the switch and hardcoding the interface at both firewall and switch end with full duplex,&amp;nbsp; we did not get any buffer overrun readings thereafter.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 13:59:21 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49994#M3722</guid>
      <dc:creator>amith_rao</dc:creator>
      <dc:date>2019-04-06T13:59:21Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49995#M3723</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Does the console on the standby still respond during the "freeze"?&amp;nbsp; Or do you have to pull the power cord to recover?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The console on the standby doesn't respond during freeze unless we boot it into online debug mode(kdb mode). We hard reboot when the freeze occurs.&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 14:10:48 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49995#M3723</guid>
      <dc:creator>amith_rao</dc:creator>
      <dc:date>2019-04-06T14:10:48Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49996#M3724</link>
      <description>&lt;P&gt;So it sounds like you are experiencing a hard hang on the standby.&amp;nbsp; In cpview history mode leading up to the incident does free memory slowly decrease?&amp;nbsp; Just wondering if the kernel has somehow managed to exhaust all free memory which would cause all user-space processes to hang/die (including getty for the console).&lt;/P&gt;
&lt;P&gt;In hang situations such as these, making an attempt to determine whether the hang is occurring in Gaia/Linux driver or Check Point's custom kernel code can be very helpful.&amp;nbsp; Let's start with Gaia/Linux:&lt;/P&gt;
&lt;P&gt;Are you using the new 3.10 kernel?&amp;nbsp; (&lt;STRONG&gt;uname -a&lt;/STRONG&gt; from expert mode)&amp;nbsp; My guess is yes and there are significantly newer NIC drivers in use by that new kernel.&lt;/P&gt;
&lt;P&gt;Another hang cause can be getting stuck inside a hardware interrupt which can be caused by hardware or a driver.&amp;nbsp; Since handling NIC traffic is by far the most common hardware interrupt operation on a firewall it is logical to look there.&amp;nbsp; I'd suggest trying to simplify what the NICs and their Gaia/Linux drivers are trying to do on both firewalls and see if if impacts the problem by disabling:&lt;/P&gt;
&lt;P&gt;1) Hyperthreading (adjust back to 6 instances for a 2/6 split via cpconfig)&lt;/P&gt;
&lt;P&gt;2) Disable Multi-Queue if enabled&lt;/P&gt;
&lt;P&gt;3) If they have been modified, set interface ring buffer sizes back to their default&lt;/P&gt;
&lt;P&gt;If the hang is occurring in Check Point code, it will be a lot tougher to find.&amp;nbsp; Might be interesting to run &lt;STRONG&gt;ips off&lt;/STRONG&gt; and &lt;STRONG&gt;fw amw unload&lt;/STRONG&gt; on just the standby and see if the problem stops happening (you'll need to run these again if you reinstall policy to the cluster).&amp;nbsp; Obviously if a regular failover to the standby occurs the IPS and AntiBot blades will not be protecting your traffic there, so take that into consideration.&amp;nbsp; Also try the following simplifications from the Check Point code side:&lt;/P&gt;
&lt;P&gt;1) Disable monitoring &amp;amp; QoS blades on gateway if enabled, these features load up extra kernel drivers on the gateway&lt;/P&gt;
&lt;P&gt;2) Disable SecureXL - Note that SecureXL cannot really be permanently disabled in R80.20 and later&lt;/P&gt;
&lt;P&gt;3) Look at the output of the &lt;STRONG&gt;enabled_blades&lt;/STRONG&gt; command, anything else you can disable?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 14:45:44 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/49996#M3724</guid>
      <dc:creator>Timothy_Hall</dc:creator>
      <dc:date>2019-04-06T14:45:44Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/50001#M3727</link>
      <description>&lt;P&gt;Hi guys,&lt;/P&gt;
&lt;P&gt;I think if a 5900 with 10000 connections freeze, then something is seriously wrong.&lt;/P&gt;
&lt;P&gt;We have several customers who use a 5900 appliance with R80.20 JHF47. This error does not occur there.&lt;/P&gt;
&lt;P&gt;Here the R&amp;amp;D should take a closer look at the appliance.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Regards&lt;/P&gt;
&lt;P&gt;Heiko&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 16:21:03 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/50001#M3727</guid>
      <dc:creator>HeikoAnkenbrand</dc:creator>
      <dc:date>2019-04-06T16:21:03Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster member freeze issue</title>
      <link>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/50002#M3728</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.checkpoint.com/t5/user/viewprofilepage/user-id/597"&gt;@Timothy_Hall&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The 5900 appliance should use a 2.6 kernel so the 3.10 kernel and driver problem is not relevant here.&amp;nbsp;But I agree with you, open server with 3.10 kernel have some problems with enabled SecureXL and network drivers.&amp;nbsp;We've also opened some cases here:-(&lt;/P&gt;
&lt;P&gt;Regards&lt;/P&gt;
&lt;P&gt;Heiko&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2019 16:37:00 GMT</pubDate>
      <guid>https://community.checkpoint.com/t5/Firewall-and-Security-Management/Cluster-member-freeze-issue/m-p/50002#M3728</guid>
      <dc:creator>HeikoAnkenbrand</dc:creator>
      <dc:date>2019-04-06T16:37:00Z</dc:date>
    </item>
  </channel>
</rss>

