Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
israelfds95
Contributor
Jump to solution

HTOP ALL CPU SND STUCK 100% ZABBIX MONITOR CPU STRANGE BEHAVIOR

After extensive investigations and opening a case with TAC, this is the information regarding this bizarre situation, which currently has no definitive solution.

OBS: I'm percepting this more on this new 9000 models, and some other strange behavior that I'm investigating. But for now this is a help for community. 

Linux commands report a high CPU load average when SecureXL operates in the User Space (UPPAK) mode
https://support.checkpoint.com/results/sk/sk180299

SK180299 explains the issue and indicates that the solution is to rely (trust) on CPview information and update the OID for monitoring.

sk-sobre-snd-100%-.jpeg

At this moment, TAC has not provided me with a definitive solution. In my opinion, if Check Point no longer intends to use Linux monitoring commands, they should either remove these options from Gaia or resolve the issue as quickly as possible, as they cause significant confusion during troubleshooting. I lost four days dealing with this problem. The customer is very dissatisfied with the situation, which is causing a lack of trust in these new versions. It is difficult to explain why this is happening now and why they should ignore something they have relied on for many years while using Check Point Security Gateway (Cluster).

Below are more screenshots to illustrate the issue.

When you look at htop, you see all SND cores at 100% all the time."
cpu-snd-100%-htop.jpeg

On CPview looks everething OK

 

hcpview-snd-view.jpg

htop-view.jpeg
On user center don't found nothing about this *usim_x86, but right now I understand that is a problem. 

Below is a PDF with additional screenshots documenting this situation. While investigating a problem on a 9700, I observed this behavior on all four 9700s that I’m implementing. My colleagues have reported it as well, and I’ve also seen it on a 9200 cluster.

0 Kudos
1 Solution

Accepted Solutions
Timothy_Hall
Legend Legend
Legend

What you are seeing as far as CPU behavior is because the 9000/19000/29000 series appliances run SecureXL in UPPAK (usim) mode by default, as opposed to the older approach with the SecureXL sim driver located in the kernel (KPPAK). Just like how the Firewall Worker Instances were transitioned out of the kernel (USFW), the same is slowly happening for SecureXL.  At some point UPPAK will probably be the default on all appliances, but it does require special hooks into the NIC drivers (more on that later) so for now UPPAK only happens on the 9000/19000/29000 series and Lightspeed appliances.  UPPAK cannot be enabled for testing on any other type of appliances at this time, and not in VmWare.  Hopefully threads such as this will help spread the word about UPPAK as it is a pretty serious change to the gateway code.

The SND instances always executing at 100% as shown in the traditional Linux-based monitoring tools when UPPAK is enabled is expected behavior.  This is due to the fact that UPPAK utilizes something called "poll mode" as opposed to the traditional interrupt-driven processing to acquire packets that KPPAK uses.  I would assume this change to poll mode was made due to the exponential growth in rates of interrupts and the associated overhead to process each interrupt.  Not such a big deal in the old days, but that overhead really adds up as traffic levels keep increasing.

UPPAK does have some limitations as compared to KPPAK mode, please see these relevant pages from my Gateway Performance Optimization R81.20 Course with my current understanding of UPPAK:

 uppak1.pnguppak2.png

 

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

View solution in original post

0 Kudos
20 Replies
Lesley
Leader Leader
Leader

First of all great way to start a topic, loads of info.

What would you like to get help with? The high loaded SND's are 'cosmetic' and should not give issues. I can confirm this because I have a 9000 with the same issue running in production. 

You need help with access to the web interfaces?

2 things that trigger me, you rolled back the Jumbo (why?) this should not be needed.

Are you building a cluster with 2 different hardware units? And is this new unit not prepared in lab env but in production network?

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
israelfds95
Contributor

.

0 Kudos
Lesley
Leader Leader
Leader

Hardware Requirements for Cluster Members

ClusterXL operation completely relies on internal timers and calculation of internal timeouts, which are based on hardware clock ticks.

Therefore, in order to avoid unexpected behavior, ClusterXL is supported only between machines with identical CPU characteristics.

 

23500 has: 2x CPUs, 20x physical cores, 40x virtual cores (total)

9700 has: 16 physical cores, total of 32 logical cores

Synchronized Cluster Restrictions

These restrictions apply when you synchronize Cluster Members:

  • All Cluster Members must run on identically configured hardware platforms.

Software Requirements for Cluster Members

ClusterXL is supported only between identical operating systems - all Cluster Members must be installed on the same operating system).

ClusterXL is supported only between identical Check Point software versions - all Cluster Members must be installed with identical Check Point software, including OS build and hotfixes.

I assume there is a Jumbo installed on the 23500 and it is not equal to the 9700.

Putting a firewall without a Jumbo in cluster with a member that has a Jumbo is big red flag. 

------

Regarding memory load, this is normal behaviour for a Linux system, even if idle. Most important to look for is in 'top' for swap. 

-------

For ARP, if you swap out hardware and switch cables the network needs to to find the new device and you will encounter ARP cache. Especially if you are going to use the same IP's. Old mac of old firewall will still be in cache. You either wait , reboot or consider using vmac(will help in clusterXL failover issues if switches cannot handle mac change of fw). 

  • SecureXL

     status on all Cluster Members must be the same (either enabled, or disabled)

  • Number of CoreXL

     Firewall instances on all Cluster Members must be the same

    Lesley_2-1723301822138.png

     

    Notes:

    • A Cluster Member with a greater number of CoreXL Firewall instances changes its state to DOWN

    • Fail-over from a Cluster Member to a peer Cluster Member with a greater number of CoreXL Firewall instances keeps all connections.

    • Fail-over from a Cluster Member to a peer Cluster Member with a smaller number of CoreXL Firewall instances interrupts some connections. The connections that are interrupted are those that pass through CoreXL Firewall instances that do not exist on the peer Cluster Member.

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
israelfds95
Contributor

.

0 Kudos
_Val_
Admin
Admin

You cannot, I repeat, YOU CANNOT use MVC to replace HW with a different module. You need to install a new cluster, cut the network, re-cable, push the policy and check network traffic. 

You have to expect downtime during this operation. 

0 Kudos
israelfds95
Contributor

.

0 Kudos
_Val_
Admin
Admin

What you are describing is not MVC procedure. 

Did you enable MVC on the old appliance?

0 Kudos
israelfds95
Contributor

Hi,

My point in bringing up this topic was simply to inform others who might encounter the same situation during a firewall migration, where all SNDs appear to be 100% stuck on htop or top. There is a specific SK available right now that describes issues with uptime, ps, top, and htop not synchronizing information properly, leading to confusion during troubleshooting. I'm not asking for any help.

And don't understand why you complicating the things here

I'm saying that this is an MVC because we have a multi-version setup with R81.10 and R81.20, not just on the gateway with the higher version, but since it's getting complicated, in the scenario I mentioned, if you put the two different appliances in a cluster 23500 configuration, replacing the standby with the 9700 with active MVC, it’s better for it to be down and receiving the cluster policy than if all the equipment were turned off at once. This is because the cluster is established, but the 9700 member is down and has a lower CPU count, making it suitable to assume the cluster. In fact, when I turn off the active 23500 to install the new 9700 and finish the activity, the first 9700 I added to the network will take over the cluster as Active.. If this were a normal scenario, upgrading just a single cluster with the same hardware appliances, we would enable it only on the gateway with the higher version, as described in the Installation and Upgrade Guide > Multi-Version Cluster Upgrade Procedure - Gateway Mode. I don't understand why this situation is being complicated. As I mentioned, I won't create a cluster with 23500 and 9700 and keep it that way. The problem I described starts before establishing SIC; the new firewall hasn't become operational yet.

0 Kudos
Lesley
Leader Leader
Leader

I would recommended reading the appliance homepage (and the important notes) (and known limitations): 

https://support.checkpoint.com/results/sk/sk181698

From there you can find 2 links that refer to this SND SK:

LightSpeed 10/25/40/100G QSFP28 Ports Administration Guide.

But since you don't need any help I think there is no point anymore to check this topic further. 

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
israelfds95
Contributor

For this activity, the expected behavior is: the 9700 establishes SIC, sends the Install Policy with the cluster policy, and cphaprob stat shows the 9700 as down and the 23500 as active. However, since the 9700 is capable of assuming the cluster, I either manually shut down the 23500 or execute cpstop on it, allowing the 9700 to take over the cluster as Active. Then, I proceed to install the other 9700 and remove the remaining 23500. I have done this multiple times, including with this client who has 2 clusters of 23500s. I replaced one, and the other exhibited this strange behavior, with both 9700s not even communicating with the SMS to establish SIC. When reverting to the 23500, everything operates normally.

0 Kudos
Lesley
Leader Leader
Leader

It is unclear because the statements you write here are different from the document, if I can quote it:

"During the test of connecting the client's network cables to the firewall, the image below
shows the Gaia web management access page via Mgmt as inaccessible.

Below is a TCP dump on eth1-02.2555, the interface that the client designated for the
management network. It shows ARP requests, and you can also see CCP UDP 8116 traffic
between the new 9700 firewall and the active 23500 member."

I think the migration backfired by using unsupported methods, ending up with an unresponsive cluster. I would place back the new cluster without interfaces and update the Jumbo again on both members. Then try to get ssh / web interface reachable again, from there you can start again. (Get console working and then mgmt for web interface)

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
israelfds95
Contributor
 

This is a difficult case because the two cluster members are geographically separated, and the client is a government institution that requires very high availability. That's why I'm following this approach of keeping one member active while adding the new 9700 (even if it stays in a state down). Then, I can shut down the other side, and this new 9700 will take over the cluster as active, allowing me to run get interfaces again and push a new policy.

However, what intrigues me in this case is that the firewall loses management access even when a notebook is connected directly to the Mgmt port, without having established SIC with the management server. I also observed this behavior repeating in the test lab with no cables connected to the GBIC module interfaces, and only the serial and Mgmt interfaces were in use.

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Is something unclear from the SK or just that is doesn't show that it applies to 9000 series devices or something else?

CCSM R77/R80/ELITE
0 Kudos
israelfds95
Contributor

I understand the solution from the SK, but the issue is that htop, ps, uptime, and top have always been functional on Gaia, including in recent versions R81.10 and R81.20. However, they are no longer operating correctly, causing confusion until I realized that they are now presenting inconsistencies. Therefore, if these tools will no longer be used on Gaia and only CPview will be utilized, Check Point needs to notify the community that is accustomed to using these tools for monitoring and troubleshooting. Alternatively, if they will no longer be used on Gaia, the options for these commands should be removed from Expert mode. And We are percepting this more on 9000 models. 

0 Kudos
Timothy_Hall
Legend Legend
Legend

What you are seeing as far as CPU behavior is because the 9000/19000/29000 series appliances run SecureXL in UPPAK (usim) mode by default, as opposed to the older approach with the SecureXL sim driver located in the kernel (KPPAK). Just like how the Firewall Worker Instances were transitioned out of the kernel (USFW), the same is slowly happening for SecureXL.  At some point UPPAK will probably be the default on all appliances, but it does require special hooks into the NIC drivers (more on that later) so for now UPPAK only happens on the 9000/19000/29000 series and Lightspeed appliances.  UPPAK cannot be enabled for testing on any other type of appliances at this time, and not in VmWare.  Hopefully threads such as this will help spread the word about UPPAK as it is a pretty serious change to the gateway code.

The SND instances always executing at 100% as shown in the traditional Linux-based monitoring tools when UPPAK is enabled is expected behavior.  This is due to the fact that UPPAK utilizes something called "poll mode" as opposed to the traditional interrupt-driven processing to acquire packets that KPPAK uses.  I would assume this change to poll mode was made due to the exponential growth in rates of interrupts and the associated overhead to process each interrupt.  Not such a big deal in the old days, but that overhead really adds up as traffic levels keep increasing.

UPPAK does have some limitations as compared to KPPAK mode, please see these relevant pages from my Gateway Performance Optimization R81.20 Course with my current understanding of UPPAK:

 uppak1.pnguppak2.png

 

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
israelfds95
Contributor

Hi Timothy_Hall, was very insightful, but could you tell me if Check Point has any plans to make htop, top, ps, and uptime work normally on new devices that come with UPPAK by default? Also, in these cases, is manually switching from UPPAK to KPPAK on these firewalls a good option, or could it cause unwanted impacts and behaviors on the new devices that come with UPPAK?

0 Kudos
Timothy_Hall
Legend Legend
Legend

The Linux-based monitoring tools are working normally and reporting CPU usage from the perspective of Gaia/Linux.  From the Check Point software perspective, cpview shows how much of the CPU time consumed in poll mode by the SNDs was actually spent processing real traffic, vs. a poll check that found nothing to process. 

The old interrupt-driven paradigm of "high CPU load=bad" is well ingrained in many a system or network administrator, and will definitely take some getting used to.  Hopefully threads such as these can spread the word as it can certainly be concerning the first time you run into it.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
israelfds95
Contributor

I understand better now, so now it would be more about consulting via CPview. It would also be useful if Check Point added a message in the expert mode alerting that CPview should now be used. When the administrator tries to use commands like htop or top, they would see the message and understand that these commands are no longer supported and that only Check Point’s software should be used. This would prevent the misconception that it’s a problem with the firewall, saving time and avoiding extensive troubleshooting until the person finds the SK or this post (or another) in the community.

0 Kudos
Hugo_vd_Kooij
Advisor

Bit hard determine when to put up notes like this.

You may warn when it's not relevant and then cause even more confusement.

So choosing to let it feel akward on the people doing the most akward installlations is propably the sane choice.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>
0 Kudos
PhoneBoy
Admin
Admin

This is the SK that describes this behavior: https://support.checkpoint.com/results/sk/sk180299 
Possible this will be addressed in a future release, but cannot say for certain.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events