R80.40 (Serious!) Stability Issues on Open Server

biskit · ‎2020-09-08

I'm hoping someone relevant in Check Point gets to see this.

On Sunday I upgraded a customer on Open Server from R80.30 to R80.40.

First of all, CPUSE in the WebUI gave me the recommended R80.40 clean install and upgrade package. Verification said that clean install was allowed, but upgrade was not supported. Very odd??? But fine....

The R80.40 with JHFA T77 Blink image said the upgrade was allowed, so I did that instead. All appeared to work.

Testing showed that pretty much everything worked, but VPN's were very unstable. With site-to-site, while the tunnel appeared to remain up (logs showed traffic, and no constant key exchanges), the end user experience was that traffic would briefly work (get a response from the other side of the tunnel), then not work, then work, then not work.... Timings seemed random, but it happened very frequently so it was unusable and unusable.

Remote Access VPN - the customer uses a mix of Capsule VPN and Check Point Mobile. Capsule was rock solid throughout ✔️. Check Point Mobile was horrendously unstable - again completely unusable.

Sadly I don't have any log files etc. from the devices any longer (so pointless raising a SR). With other jobs on the list prior to this I'd been on it for 15 hours - it was 3am and I was tired and hungry so wasn't thinking far enough ahead to collect data. But I noticed that the vpnd.elg file had errors in matching sk164878. Solution 1 didn't work, and solution 2 was a clean install.

I did a clean install of R80.40 from the ISO, configured all the interfaces etc. again, put Take 77 on (because we already know R80.40 on Open Server is not supported without JHFA), installed the policy then test again. Exactly the same thing happened. Everything non VPN related worked fine, but VPN was highly unstable.

There were also some odd things being reported by SNMP, but I wasn't really concerned with that.

There were also some other weird occurrences such as the LAN not being able to get out to the Internet. Reboot the gateway and it would work again for a while, then randomly stop again.

At this point I had 2 hours left until 600 users started logging in. I had to do something drastic. So I installed R80.30 from the ISO. Thankfully (and not surprisingly as they were previously running R80.30 with no issues) everything worked perfectly. All SNMP alerts cleared up too, and the LAN worked flawlessly. I've left them on R80.30 and everything is fine.

I had two weeks of absolute hell with R80.40 on Open Server when it was first released (before there were any JHFA's). Same problems - random VPN instability, and randomly the LAN would stop passing traffic. Very shortly afterwards Check Point pulled support for 80.40 on Open Server while issues were being addressed. After JHFA T25 was released it went 'back on the market' - with T25 supposedly fixing the issues. Well, I can confirm beyond doubt that R80.40 Gateway on Open Server with T77 is still unstable and unusable (R80.40 Management on Open Server seems rock-solid - just the gateway which is flaky). Maybe it's only affecting certain hardware? Both times I've had these issues have been on HP DL360/DL380 hardware. The most recent was DL380 Gen10.

We have R80.40 out there on many flavours of CP appliance and it's rock solid. It's only Open Server gateways that seem to have the problem.

So there's nothing anyone can do, and as I've rolled back to R80.30 and have no log files from the flaky R80.40 there's nothing TAC could do for me now either. I didn't have the time to wait for TAC to fix it - I was really up against the clock, and R80.30 is stable. I just wanted to get this info out there so that Check Point is aware there are still issues, and so that anyone else thinking of going R80.40 Gateway on Open Server at least lab tests it thoroughly on their hardware before putting it into production. It might save you a world of pain.

G_W_Albrecht · ‎2020-09-08

When doing the upgrade (on StandAlone ? Or did you just leave out that this is SMS + GW Cluster ?), you can have TAC supporting you in RAS, either watching install or on standby in case something goes amiss. After the first try did fail completely this is the best way to go in my eyes...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

biskit · ‎2020-09-08

It was a cluster (separate Mgmt). The problem I had was the clock was ticking, and there was just me. I was testing stuff, trying to look in the support site for clues/fixes, and I tried TAC Chat but that didn't answer. Often when I've called TAC it takes them a while to find someone to call me back, and at 3am on Sunday night (Monday morning) and assuming most people are still working from home(?) I figured there wouldn't be a ready supply of available engineers waiting to help immediately. So being so short on time I went with the R80.30 option just to get back working again. Not ideal I know 😞

shlomip · ‎2020-09-08

@biskit ,

I am sorry to hear about your experience,

I can say that we are not familiar with specific VPN issues, certainly not such that cause the behavior you described

and for sure not such that are related to Open Servers in specific.

Rest assure that if we we were aware of such, we would fix them quite fast.

As you already went back to R80.30, it will be very hard to troubleshoot it, we can try and check if we can see it in-house.

If you happen to upgrade to R80.40 again and face the same issue, please open a TAC issue and share its # so we can try and help.

I also suggest to check this thread for VPN troubleshooting in case needed.

Thanks!

Ilya_Yusupov · ‎2020-09-08

Hi @biskit ,

i see that you upgraded from R80.30, is it with linux 3.10? if yes can you please share with me offline vpn tu tlist output from R80.30?

also can you mention if you are using ISPR?

Thanks,

Ilya

biskit · ‎2020-09-08

I put the kernel 2.6 version of R80.30 back on as this is what it had previously and I with the clock ticking I just wanted to put back what I knew worked.

(For those that are wondering, there wasn't enough room on the /var/log partition to export the snapshot, so the only option I had was a clean build and paste the Gaia config back on).

Curious question about ISPR. I presume you mean ISP Redundancy? In which case - yes - it does have ISPR. But even while the VPN was down, 'cpstat fw' showed ISP1 as primary, so I don't think it was switching between ISP lines. Also, maybe irrelevant, but the other gateway/customer I mentioned who had the same issues (with no JHFA's on) did NOT have ISPR.

I don't have the output any more, but I can also confirm that at the time the VPN was "down", 'vpn tu tlist' showed affected peer in the list, so I presume that means the tunnel was still up? And the reason it wasn't passing traffic (as seen in vpnd.elg) was something else?

Maarten_Sjouw · ‎2020-09-08

We have been having alot of problems with R80.40 on Cloudguard implementation (private cloud) on a Openstack where we found the new User Mode FW to be the problem, when we turned it back to Kernal Mode FW it started to behave normal.

Regards, Maarten

Ilya_Yusupov · ‎2020-09-08

Hi @biskit ,

Yes i was refer to ISP Redundancy, we have some known issue related to routing, where in some cases in ISPR or when VPN tunnel is not accelerated packets are going out through wrong interface/route.

This is why i asked about ISPR and the vpn tu tlist output to check if it's same issue or not.

The fix is ready and we will try to release it in next JHF i will update you once such JHF will be ready.

Thanks,

Ilya

biskit · ‎2020-09-09

@Ilya_Yusupov That's interesting. The other install I mentioned (with no JHFA's at the time, and no ISPR) there was definitely some weird routing happening at the times the problem randomly occurred. We saw traffic between two internal interfaces being routed via the External interface, even though it was not in the routing table. We saw this in tcpdump. After reverting to R80.30 that weird routing did not happen and everything worked correctly again. So it's good to see that problem is still being investigated and fixed, but based on my previous environment I would say it's not necessary to do with ISPR?

On the latest problem on Sunday, I tried with 'fwaccel on' and 'off', and also 'vpn accel on' and 'off'. Nothing made a difference to the VPN stability.

Ilya_Yusupov · ‎2020-09-09

Hi @biskit ,

You confirmed that this is same issue, as i mention the issue may happen on 2 flows:

1. When you have ISPR - there is some flow specific that may cause to wrong routing decision.

2. When you have VPN tunnel that is not accelerated, in such case we will also may have wrong routing decision.

i hope the above is clear and answering to your questions.

Thanks,

Ilya

biskit · ‎2020-09-09

@Ilya_Yusupov
I was trying to point out that I've seen the same happen on a system without ISPR, and also there was no difference whether acceleration was on or off.

Ilya_Yusupov · ‎2020-09-09

@biskit ,

The issue may happen without ISPR on flows where vpn tunnel is not accelerated.

once you encounter such issue vpn accel off or on will no matter as the issue related to corrupted routing cache table.

Tobias_Moritz · ‎2020-09-08

Matt, just a litte side note: If you ever face the problem again, that you don't have enough space on gateways file system to export a snapshot: You can mount a CIFS of NFS share using expert shell and then let the export feature write directly to this share. Only possible of course, if there is such a share with enough space available and reachable from gateway.

[Expert@firewallABC:0]# clish -c "show snapshots"
Restore points:
---------------
preR8030JHFAT140

Creation of an additional restore point will need 19.104G
Amount of space available for restore points is 86.8G

[Expert@firewallABC:0]# mkdir /mnt/smbshare
[Expert@firewallABC:0]# mount -t cifs -o user=domain/userXXX //servername/fileshare /mnt/smbshare
Password:

[Expert@firewallABC:0]# clish -c "set snapshot export preR8030JHFAT140 path /mnt/smbshare/userXXX/directory/ name preR8030JHFAT140_firewallABC"
Exporting snapshot. You can continue working normally.
You can use the command 'show snapshots' to monitor exporting progress.

[Expert@firewallABC:0]# clish -c "show snapshots"
Restore points:
---------------
preR8030JHFAT140

Export image now under creation:
---------------------------------
preR8030JHFAT140 (3%)

Creation of an additional restore point will need 19.104G
Amount of space available for restore points is 86.8G

[Expert@firewallABC:0]# clish -c "show snapshots"
Restore points:
---------------
preR8030JHFAT140

Export image now under creation:
---------------------------------
preR8030JHFAT140 (100%)

Creation of an additional restore point will need 19.104G
Amount of space available for restore points is 86.8G

[Expert@firewallABC:0]# clish -c "show snapshots"
Restore points:
---------------
preR8030JHFAT140

Creation of an additional restore point will need 19.104G
Amount of space available for restore points is 86.8G

[Expert@firewallABC:0]# ls -lh /mnt/smbshare/userXXX/directory/preR8030JHFAT140_firewallABC.tar
-rw-r--r-- 1 admin root 8.1G Mar  5 11:54 /mnt/smbshare/userXXX/directory/preR8030JHFAT140_firewallABC.tar

[Expert@firewallABC:0]# umount /mnt/smbshare
[Expert@firewallABC:0]# rm -r /mnt/smbshare

biskit · ‎2020-09-09

@Tobias_Moritz
Good idea, thanks!

Thomas_Eichelbu · ‎2020-09-28

Hello together,

well iam pretty sure this has nothing to do with OpenServer specifiy hardware!
we have also seen alot of instabilties, crashes and whatever after with HFA77.
especially TE enabled machines where highly affected by this issue!

TAC said:

We indeed identified an issue with jumbo take 69 and above. It’s related to Spike Detector feature we added in this Jumbo. A permanent fix for this will be added in the next ongoing take. For now there is a workaround we can do by disabling this feature. Please see below the procedure to disable it. This is of course if you want to install this take once again and not wait for the next GA take with an integrated fix for this issue.

Disable Spike Detector –

[Expert@Firewall]# cpwd_admin stop -name "SPIKE_DETECTIVE"

[Expert@Firewall]# $CPDIR/bin/cpprod_util CPPROD_SetValue fw1 SpikedetectiveOff 4 1 1

Re-enable Spike Detector –

[Expert@Firewall]# $CPDIR/bin/cpprod_util CPPROD_SetValue fw1 SpikedetectiveOff 4 0 1

[Expert@Firewall]# cpwd_admin start -name "SPIKE_DETECTIVE" -path "$FWDIR/bin/spike_detective" -command "spike_detective"

so the take above HFA69 are faulty, and should not used with a running spike detector!

this helped us to get running R80.40 on many different appliances and also Openservers.

best regards
Thomas

Are you a member of CheckMates?

R80.40 (Serious!) Stability Issues on Open Server