Solved: r80.30 take196 - fw unable to accept new connectio...

JonnyV · ‎2020-06-22

Has anyone else experienced issues with the fw passing traffic when updating to take 196?

Before we did not have to push policy when updating minor takes (ex 155 to 196)

However we ran into issues last night where we updated to take 196 and new connections were not being accepted until we pushed policy post patch and reboot.

Our assumption was since the fwpol is cached on the fw and loaded on reboot, that a policy push shouldn't be mandatory post patch.

v/r,

Jon

(we have opened a tac case for our RFO)

Mario_Lucas · ‎2020-06-22

We saw this on our VSX upgrades recently to take 196 (three separate clusters had the issue). Our upgrades were from 155 to 196.

Also resolved with policy install. Oddly it did not impact all VS, only some.

Fortunately I found a community post at the time as I was scratching my head what had happened.

Didn't bother with TAC case as I had completed all of the required upgrades but it certainly seems to be an issue for multiple Checkpoint customers.

View solution in original post

FedericoMeiners · ‎2020-06-22

It's strange. Where you able to run fw ctl zdegug drops to see the drop reason for new connections?

____________
https://www.linkedin.com/in/federicomeiners/

JonnyV · ‎2020-06-24

No, we were more concerned about uptime than debugging, plus we have had some serious impact recently when running captures.

GabsOliv · ‎2020-06-22

Hi
Have experienced exactly the same issue.
In tree or four patch apply, have experienced the issue on the firewall with higher uptime

Mario_Lucas · ‎2020-06-22

We saw this on our VSX upgrades recently to take 196 (three separate clusters had the issue). Our upgrades were from 155 to 196.

Also resolved with policy install. Oddly it did not impact all VS, only some.

Fortunately I found a community post at the time as I was scratching my head what had happened.

Didn't bother with TAC case as I had completed all of the required upgrades but it certainly seems to be an issue for multiple Checkpoint customers.

Martin_Peinsipp · ‎2020-06-23

Hi!

We had the same issue on a VSX-Cluster. We did not have an overal outage, but most allowed traffic was blocked by the last clean-up-rule. Manual policy-installation did not help. We uninstalled take 196 again and opened a TAC case.

Martin

FedericoMeiners · ‎2020-06-23

Yesterday we deployed the same take (196) in our R80.30 Kernel 2.6 VSX HA Cluster of two 23500.
After installing JHF on the last member sync was broken and completly corrupted in that member (Active/Down. So far we didn't manage to uninstall JHF since we ran out of time in the maintenance window.

Case opened: 6-0002039948

____________
https://www.linkedin.com/in/federicomeiners/

Martin_Peinsipp · ‎2020-06-30

Hi!

Did you get an update from the TAC-Team? What did they analyze till now?

Best regards

Martin

Alex- · ‎2020-06-23

Is it on 2.6 or 3.10? I've deployed Take 196 in large VSX environments without apparent issues, running R80.30 3.10, in case it would make a difference.

JonnyV · ‎2020-06-23

k3.10

firewall-1 ~ # cpinfo -yall

This is Check Point CPinfo Build 914000202 for GAIA
[IDA]
No hotfixes..

[MGMT]
No hotfixes..

[CPFC]
HOTFIX_R80_30_GOGO_JHF_MAIN Take: 155

[FW1]
HOTFIX_MAAS_TUNNEL_AUTOUPDATE
HOTFIX_R80_30_GOGO_JHF_MAIN Take: 155

FW1 build number:
This is Check Point's software version R80.30 - Build 001
kernel: R80.30 - Build 159

[SecurePlatform]
HOTFIX_R80_30_GOGO_JHF_MAIN Take: 155

[PPACK]
HOTFIX_R80_30_GOGO_JHF_MAIN Take: 155

[CPinfo]
No hotfixes..

[CPUpdates]
BUNDLE_MAAS_TUNNEL_AUTOUPDATE Take: 25
BUNDLE_CPINFO Take: 50
BUNDLE_INFRA_AUTOUPDATE Take: 32
BUNDLE_DEP_INSTALLER_AUTOUPDATE Take: 13
BUNDLE_R80_30_JUMBO_HF_MAIN_3_10_GW Take: 155

[AutoUpdater]
No hotfixes..

[DIAG]
No hotfixes..

[CVPN]
No hotfixes..

[CPDepInst]
No hotfixes..

Yifat_Chen · ‎2020-06-24

Hi All.

My Name is Yifat Chen and i am managing R80.30 Jumbo releases in Check Point.

Thanks for all the details you shared here, we have a ticket associated to the issue and we will update here ASAP with our findings

Release Management Group

Tobias_Gasior · ‎2020-06-29

Hi,

are there any update regarding this issue?

We are planning to update our GW to Jumbo 196.

Thanks and best regards

Tobias

JonnyV · ‎2020-06-30

TAC:
"When upgrading the jumbo hotfix on a gateway pushing policy is not a required step. The gateway should load the last successfully pushed policy post-reboot.

However, if encountering traffic issues post hot-fix installation one of the first steps recommended would be pushing policy. "

Martin_Peinsipp · ‎2020-06-30

Hi!

As decribed above, in our case this did not help.

We pushed the policy after upgrading to jumbo-take 196 + reboot, but the policy was not enforced proper.

Best regards Martin

Martin_Peinsipp · ‎2020-08-27

Hello!

Please can you share your internal findings? Can we be sure, that this was a kind of bug in JT196?

Best regards

Martin

Yifat_Chen · ‎2020-09-08

Hi All,

My apology for the late response.

As was reported in this thread, we've tried to reproduce this issue in our lab in order to investigate it, but had no luck. Also, there was no other complains/ tickets to TAC about this issue in our later Jumbo Takes.

Our recommendation is to use the latest GA Jumbo Take (currently it's #215 but the latest ongoing take #217 will be moved to GA soon).

In case you will face the same issue in the future, the best will be to open a ticket with our support and they will debug/troubleshoot the problem while the issue is occurring.
I can assure you that we will take these issues seriously, and will be available to assist whenever it is required.

Thanks,

Release Management Group

Tobias_Moritz · ‎2020-06-26

I just want to share that we had the same problem on all our non-VSX HA clusters and T191

So it seems like the problem was first introduced with T191 and is not VSX related.

Details if needed:

Click to Expand

Source Version:

R80.30 Gaia 2.6 JHF 155

Target Version:

R80.30 Gaia 2.6 JHF T191

Symptoms excactly as decribed here:

Update passive cluster member. Reboot it.
Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.
Switch traffic to updated member by clusterXL_admin down on non-updated member.
Complete outage, because updated member drops all traffic.
Switched traffic back to non-updated member by clusterXL_admin up.
Traffic is working again.
Doing a policy install.
Switch traffic to updated member again by clusterXL_admin down on non-updated member.
This time, everything is working.
Same problem occured on the second member. This member also needed a policy installation after first boot with T191 before it was accepting traffic.

After having the same experience with our first two clusters, we changed the workflow for all following ones to avoid the problem:

Update passive cluster member. Reboot it.
Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.
Doing a policy install.
Switch traffic to updated member by clusterXL_admin down on non-updated member.
Traffic is working fine.
Update second cluster member. Reboot ist.
Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.
Doing a policy install.
Switch traffic to updated member by clusterXL_admin up on last updated member.

We tried a reboot later some time to reproduce the problem, but it was not reproducable. So it only occured after first boot after minor version update.

Source Version:R80.30 Gaia 2.6 JHF 155Target Version:R80.30 Gaia 2.6 JHF T191Symptoms excactly as decribed here:Update passive cluster member. Reboot it.Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.Switch traffic to updated member by clusterXL_admin down on non-updated member.Complete outage, because updated member drops all traffic.Switched traffic back to non-updated member by clusterXL_admin up.Traffic is working again.Doing a policy install.Switch traffic to updated member again by clusterXL_admin down on non-updated member.This time, everything is working.Same problem occured on the second member. This member also needed a policy installation after first boot with T191 before it was accepting traffic.After having the same experience with our first two clusters, we changed the workflow for all following ones to avoid the problem:Update passive cluster member. Reboot it.Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.Doing a policy install.Switch traffic to updated member by clusterXL_admin down on non-updated member.Traffic is working fine.Update second cluster member. Reboot ist.Checked if passive member has loaded policy, load stabilized at about zero and sync is working fine. Even checked fw ctl multik stat for updated connection table and cphaprob syncstat for sync updates.Doing a policy install.Switch traffic to updated member by clusterXL_admin up on last updated member.We tried a reboot later some time to reproduce the problem, but it was not reproducable. So it only occured after first boot after minor version update.

JonnyV · ‎2020-06-30

so far it's happened on our 2.6 appliance and our 3.10 open FW.
Pretty certain that 196 will drop all new packets until the mgmt server pushes policy.
Tried running a fw fetch and the gw said it downloaded and applied, but the new connection issue didn't resolve the issue until we installed policy from the MGMT server.

RS_Daniel · ‎2020-07-29

Hi,

We also had some problems with JHA Take 196, first the same problem with no proper policy enforcement which was solved pushing policy. However some days later, the active member in a ClusterXL was lost, had no connectivity at all, we had to connect trough console port and we saw that the appliance was in this loop:

[<ffffffff802447c0>] (i8042_interrupt+0x0/0x240)
Disabling IRQ #1
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
BUG: soft lockup - CPU#0 stuck for 10s! [kseriod:107]
irq 1: nobody cared (try booting with the "irqpoll" option)
handlers:
[<ffffffff802447c0>] (i8042_interrupt+0x0/0x240)
Disabling IRQ #1

We have experienced this in two clusters, TAC suggested to upgrade to JHA ongoing Take 215 (not GA Take 214). That solved the issue until now. Some other problems with with management server, solved with jumbo 215 as well.

Are you a member of CheckMates?

r80.30 take196 - fw unable to accept new connections till fwpol is reapplied from mgmt server