Re: Check Point R81.20 stability issues

sorinstf · ‎2024-01-17

Hello,

I write this post to ask for an advice. If Check Point TAC wasn't so bad, I would not had to ask for another opinion.

In November we have upgraded a part of our Check Point gateways from R80.40 T180 to R81.20 take 38. We have started with the MDS, and everything was ok. FW are hw appliances (3600/5000/6000), and are running IPS and IA only blades only. Only static routing, these FW's are not internet facing, standard fw ruleset.

In less than 1 month we have started encountering stability issues:

1) on a 6000 series fw cluster PDP broker daemon has crashed. HA cluster was up, and it took us some time to troubleshoot the issue and failover. FW has a low load, it was sized to 10 time the load.

2) on a 5600 series fw cluster each CPU core started to increase utilization from 7-8% up to 95-98% in about 14 hours on last Sunday afternoon. You can imagine what happened on Monday morning ... nothing was working anymore!

This incident repeated every 3 times in 2 months. HA failover and restart. no rules have changed in the past 2 month. No configuration updates.

TAC was not able to pinpoint the root cause, just making some performance improvement recommendations. (HCP and cpinfo logs were collected as well).

Now we have configured alarms on CPU usage with thresholds to receive notifications when a fw goes crazy.

Is there anyone else encountering stability issues on R81.20 T38?

Thanks!

p.s. we took the decision to upgrade to R81.20 when R80.40 was still announced EOS on Jan 2024.

_Val_ · ‎2024-01-22

Do you have SRs for any of those problems? If yes, which ones are you struggling with?

sorinstf · ‎2024-01-22

Hello Val,

The biggest issues are with the 5000 series. Yesterday we had to fail over a 5200 fw to the stand-by member, then reboot the firewall. I have a SR for a 5600 series.

CPU usage started increasing from 7-8% for no reason to 80% (when we failed over not to impact the availability of the services).

This resolution is from TAC case opened on 15/11:

"the issue stemmed from the overutilization of SNDs and out-of-order packets. Upon further investigation within SND, the identified thread function causing the issue is:

# Overhead Command Shared Object Symbol # ........ ....... ................. ...............................................
# 90.95% snd_c [kernel.kallsyms] [k] native_queued_spin_lock_slowpath

TAC engineer recommended a couple of config changes and tweaks:

1) Enable the drop template on the cluster

2) Enable Dynamic balancing on both members. - not the case on 5200/5400 as they both have 2 core. For 5600 series we are going to configure an additional core for SND. (2 for SND and 2 for FW).

3) Prioritize the placement of the most frequently used rules at the top of the policy list. - I don't think this is an issue as this FW CPU load is usually between 7%-15 %

4) If feasible, expedite port 1024 in the fast path. - this is the TAC engineer was wrong ( I believe) as cpview showed 24% CPU utilization for TCP: 1024 and he thought that means high traffic on this port. In fact what I understood is that TCP:1024 means usage on ports above TCP:1024..

Basically, we still don't know what triggers this strange behavior on 5000 series on 81.20. It was rock stable on 80.40.

"root cause of the issue as SND and out-of-order packets" Why? What is causing this behavior on R81.20? Are the appliances to old for 81.20? I know they are EOS Dec 2025.

Thanks!

@ the_rock - that's good to know!!

_Val_ · ‎2024-01-23

Here are some notes.

First, it always helps to be on the latest recommended HFA level.

Second, correlation does not mean causation. You are blaming R81.20, and that may or may not be true. I remember a case when we were chasing an elusive cluster instability trigger. At the same time, the actual problem was related to a targeted port scan from the internet, overwhelming the box in a couple of seconds and causing CLX to misbehave.

You also claim TAC was not good at all while listing several recommendations which make a lot of sense to me personally. Did you follow them yet?

I would encourage you to continue working with TAC on that. If you think your assigned engineer was incorrect with his/her analysis, you can always argue your point in the case, or by phone, or request an escalation.

R81.20 is a recommended version now. That means we have good statistics about its stability and adoption.

Hope you figure out the root cause sooner rather than later.

sorinstf · ‎2024-01-23

Hello Val,

Thanks for the follow up. Definitely TAC recommendations are the way to go. TAC engineer was very responsive, and he replied in due time to our updates. Our frustration is with R81.20 issues, not with the engineer.

The only change left to implement is on 5600 series where we can redistribute cores 2 for SND and 2 for CoreXL. Then we'll wait for another 1-2 weeks to see if the issue reoccurs. For 5200/5400 series, there's no option for core redistribution.

I thought about updating to JHA T41, but I don't see anything on the release note related to our issue.

_Val_ · ‎2024-01-23

Hmm, could I misunderstand this sentence: "If Check Point TAC wasn't so bad, I would not had to ask for another opinion"?

sorinstf · ‎2024-01-23

Let's say we all have the right to second opinion and as @the_rock mentioned above, this issue might be encountered by other users, as well! Spreading the good news to every1:))

the_rock · ‎2024-01-23

Thats right M8 😉

I mean mate haha

But in all seriousness @sorinstf , yes, of course we all have different opinions, but in my mind, as long as we are respectful to others and its healthy discussion, everyone benefits.

Best,

Andy

_Val_ · ‎2024-01-23

You definitely have a right to ask any question on CheckMates, and also a right to complain about the quality of support, if you have any issues with it. I am just trying to understand the situation here.

Albin · ‎2024-04-09

We had high CPU and slowness on SND cores and saw native_queued_spin_lock_slowpath was the issue with spike_detective too. There was no more traffic than usual when this occured. This caused some interface issues with LACP also. The CoreXL instances was normal usage. We are on R81.20 Take 41.

Does anyone have more information on this?

We have a TAC case at the moment to get understanding as well.

Albin · ‎2024-04-29

For anyone having the same issue, the SK is sk181996 & as Hugo wrote below PR is PRHF32357.

There is a workaround available in the SK. We have not had the issue since implementing it, going to patch to the hotfix this week on the cluster the issue occured. We have put the workaround on all our devices just in case until it is included in Jumbo.

the_rock · ‎2024-01-22

I can tell you I actually had similar issues in the lab myself and when I installed jumbo 41, it all went away. Never seen it with any customers, but thats for sure because that jumbo was never installed in any production environment.

Best,

Andy

sorinstf · ‎2024-03-05

Update: TAC has provided PRHF-32357.

For the last 2-3 I haven't seen random CPU spikes happening. We are still monitoring the gateways where we have installed the hotfix.

Hugo_vd_Kooij · ‎2024-04-29

I can't see PRHF32357 anywhere in the list for the latest Jumbo Hotfix (Take 54) yet. So it is not considered a generic fix at the moment.

It seems we have a customer with a similar thing.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

Hugo_vd_Kooij · ‎2024-04-30

The issue is listed in https://support.checkpoint.com/results/sk/sk181996

Just performed the workaround with the customer and it resolved the issue for them right away.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

sorinstf · ‎2024-04-30

In out case the hotfix solved the issue. It is currently applied to JHA T38 and JHAT41. We

cpinfo returns: HOTFIX_R81_20_JHF_T41_111_MAIN

Are you a member of CheckMates?

Check Point R81.20 stability issues