Re: Cluster member freeze issue

amith_rao · ‎2019-04-06

Hi all,

We are facing a peculiar issue with our R80.20 cluster.

Hardware: 5900 appliance

OS/Version: GAIA R80.20

Blades Enabled: Firewall, IPS and Anti-bot.

Every week at least once one of the cluster members freezes, always standby member and only comes up after a reboot.

When we check the health using CPview history during the time of the issue say CPU, RAM, Connections, Hmem, Smem, Kmem failed allocation, all seems fine and in fact, the CPU is hardly 10% utilized, RAM 10%, Connections less than 10,000.

Currently, R&D is involved and working on this. Based on their analysis we have disabled priority queue, drop optimization but no luck.

Would be helpful if you can bring in your expertise to narrow down the issue while R&D continues its investigation.

HeikoAnkenbrand · ‎2019-04-06

- Is jumbo hotfix 47 installed?

- Any error in /var/log/messages

- If you have only 10000 connections disable SecureXL and check it again.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2019-04-06

FYI:
SecureXL has been significantly revised in R80.20. It now works in user space. This has also led to some changes in "fw monitor", The SecureXL driver takes a certain amount of kernel memory per core and that was adding up to more kernel memory than Intel/Linux was allowing.

More infos here:

R80.x Security Gateway Architecture (Logical Packet Flow)

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2019-04-06

I think the R&D is the rigth way.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

amith_rao · ‎2019-04-06

Please find the answers/comments in line

- Is jumbo hotfix 47 installed?

Currently Jumbo take 33 installed and take 47 have no resolved issue with respect to freezing.

- Any error in /var/log/messages

During the time of the freeze till reboot, no relevant information found in var/log/message. In fact, no information at all during the time of the freeze.

- If you have only 10000 connections disable SecureXL and check it again.

The freeze is most often observed early in the morning, so in the day time, the traffic surpasses more than 56000 so securexl cannot be disabled.

But if securexl is to cause an issue it should be on the active member but why is the standby member which is not handling any traffic is getting affected affected

Timothy_Hall · ‎2019-04-06

Does the console on the standby still respond during the "freeze"? Or do you have to pull the power cord to recover?

When you say "always standby member" do you mean that the issue always occurs on whatever member happens to be standby, and it has happened on both pieces of hardware? Or does it always happen on the same piece of hardware that is standby? If the latter check the hardware sensor data, I believe you can look at historical sensor data right from cpview in R80.20 and later.

What do the commands cphaprob stat, cphaprob -a if and cphaprob -l list display while the standby member is in its afflicted state? Does ClusterXL still report everything is OK or does it report a failure? What I would try to do in this case is determine if it is ClusterXL itself misbehaving, or the underlying firewall infrastructure that is experiencing a problem and ClusterXL is just reporting it. Based on the troubleshooting steps so far it sounds like TAC suspects something in the underlying firewall code. I assume TAC has already looked in /var/log/messages* for any smoking guns?

Is the standby member experiencing issues with the Sync interface specifically? If so see these threads:

https://community.checkpoint.com/t5/General-Topics/Issue-on-the-sync-interface/m-p/30640

https://www.cpug.org/forums/showthread.php/22679-HA-Failover-appears-to-be-caused-by-sync-interface

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

amith_rao · ‎2019-04-06

Hi, Timothy

We have faced this issue on both piece of hardware. i.e Any cluster member which happens to be in standby mode freezes.

In CPview history we are unable to see the hardware sensor readings like CPU temperature fan speed etc.

The clusterxl commands reports issue.

Command outputs

#cphaprob stat

Member1 - Active Attention
Member 2 - Lost.

#cphaprob -a

Out of the 15 interfaces, we see 3 interfaces in downstate which include Sync interface also. The same 3 interfaces show down during every freeze incident.

#cphaprob -l list

All ok on Member 1
Member 2 not accessible.

We see no write up in the var/log/messages from the time of freeze till the box is rebooted.

Is the standby member experiencing issues with the Sync interface specifically?

Nothing related to Sync issue but we could see some RX buffer overrun on Sync interface. Since the Sync between the cluster member was connected back to back, we changed this by connecting them through the switch and hardcoding the interface at both firewall and switch end with full duplex, we did not get any buffer overrun readings thereafter.

amith_rao · ‎2019-04-06

Does the console on the standby still respond during the "freeze"? Or do you have to pull the power cord to recover?

The console on the standby doesn't respond during freeze unless we boot it into online debug mode(kdb mode). We hard reboot when the freeze occurs.

Timothy_Hall · ‎2019-04-06

So it sounds like you are experiencing a hard hang on the standby. In cpview history mode leading up to the incident does free memory slowly decrease? Just wondering if the kernel has somehow managed to exhaust all free memory which would cause all user-space processes to hang/die (including getty for the console).

In hang situations such as these, making an attempt to determine whether the hang is occurring in Gaia/Linux driver or Check Point's custom kernel code can be very helpful. Let's start with Gaia/Linux:

Are you using the new 3.10 kernel? (uname -a from expert mode) My guess is yes and there are significantly newer NIC drivers in use by that new kernel.

Another hang cause can be getting stuck inside a hardware interrupt which can be caused by hardware or a driver. Since handling NIC traffic is by far the most common hardware interrupt operation on a firewall it is logical to look there. I'd suggest trying to simplify what the NICs and their Gaia/Linux drivers are trying to do on both firewalls and see if if impacts the problem by disabling:

1) Hyperthreading (adjust back to 6 instances for a 2/6 split via cpconfig)

2) Disable Multi-Queue if enabled

3) If they have been modified, set interface ring buffer sizes back to their default

If the hang is occurring in Check Point code, it will be a lot tougher to find. Might be interesting to run ips off and fw amw unload on just the standby and see if the problem stops happening (you'll need to run these again if you reinstall policy to the cluster). Obviously if a regular failover to the standby occurs the IPS and AntiBot blades will not be protecting your traffic there, so take that into consideration. Also try the following simplifications from the Check Point code side:

1) Disable monitoring & QoS blades on gateway if enabled, these features load up extra kernel drivers on the gateway

2) Disable SecureXL - Note that SecureXL cannot really be permanently disabled in R80.20 and later

3) Look at the output of the enabled_blades command, anything else you can disable?

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

HeikoAnkenbrand · ‎2019-04-06

Hi guys,

I think if a 5900 with 10000 connections freeze, then something is seriously wrong.

We have several customers who use a 5900 appliance with R80.20 JHF47. This error does not occur there.

Here the R&D should take a closer look at the appliance.

Regards

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2019-04-06

Hi @Timothy_Hall

The 5900 appliance should use a 2.6 kernel so the 3.10 kernel and driver problem is not relevant here. But I agree with you, open server with 3.10 kernel have some problems with enabled SecureXL and network drivers. We've also opened some cases here:-(

Regards

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

Are you a member of CheckMates?

Cluster member freeze issue