cancel
Showing results for 
Search instead for 
Did you mean: 
Post a Question
Oliver_Fink
Nickel

Installation & Upgrade problems from R77.x to R80.x

We did several update projects from different versions of R77.x to R80.x. Some went good, some went bad. I want to share our experiences and to ask what your experiences are and how you solved problems. Maybe we can learn from each other.

Customer 1: R77.30 to R80.20.M1 & R80.10

Environment:

  • 2 security gateway clusters (HA)
    • 21800
    • 13800
  • 2 log servers
    • Smart-1 3500
  • 1 security management server
    • Open Server in VMware

Problems:

The upgrade worked without problems. We had problems to run special scripts for the customer. They are not related to the upgrade but to the bad documentation of SmartEvent internals. It seems that parts of this documentation have not been updated for years.

Customer 2: R77.30 to R80.20

Environment:

  • 1 security gateway
    • Open Server
  • 2 remote office VPN gateways
    • 1120
    • 1140
  • 1 security management server
    • Open Server

Problems:

The upgrade was done without problems. But the customer discovered problems with one VPN connection. Mails could not be send anymore while other communication worked (RDP). This site was connected with a DSL router and the 1120 behind. The problem was with the MSS clamping used that worked seamlessly before. The following SKs were involved:

We finally got the VPN running again.

Customer 3: R77.30 to R80.20

Environment:

  • 1 security gateway cluster
    • 4600
  • 1 security management server
    • Smart-1 504, migrated to Open Server in VMware

Problems:

We did the update because the customer changed his ISP. The new provider implemented an internet connection with backup via LTE. This is transparent to the firewall. As a result of this the external IPs of the gateway cluster changed to private address space, something from the 10.0.0.0/8 range. The official IPs reside at the ISP's data center and are NATed there. Such, we needed NAT traversal on incoming and outgoing VPN connections.

We changed the configuration and did the upgrade. Using only R80.20 security management and R77.30 gateway cluster we did not experience problems. But still outgoing VPN connections with NAT-T were not possible, because you need R80.x for that. For that we also upgraded the gateways. We tried with CPUSE upgrade. During this process the gateways lost their SIC names and certificates. We had no faith that nothing else went wrong besides that. So we decided to do a fresh install with from a USB stick with ISOmorphic.

After activating the new R80.20 gateways we had two serious problems:

  1. Massive problems on SMTP connections from the customers MX at a service provider to the internal Trend Micro gateway. We saw massive log entries of "First packet isn't SYN". They got less after we fixed problem #2 but did not vanish totally. We will have a closer look on this in 2019.
  2. Site-to-site and Remote Access VPNs did not work anymore. We thought we did something wrong with the configuration. But that made no sense because also VPNs working incoming with R77.30 gateways did not work anymore. Anyway, we were not sure that we tested enough. Finally we involved TAC. The supporter gathered some information and said that he was going to analyze them. I saw some occurrences of "dropped by fwpslglue_chain Reason: PSL Reject: internal - reject enabled" when he executed "fw ctl zdebug drop". I should have issued the command by myself, so shame on me. I put the debug message into Google search and the first entry lead me to sk109777: Traffic is dropped with log: "PSL Reject: internal - reject enabled" after migration to a .... The solution was sk33328: How to clear $FWDIR/state/ directory to resolve policy corruption issues. This meant that we had to execute a cpstop on security management server and all gateway cluster nodes at the same time – this means a total outage of the firewall cluster. No fun for the customer, but this fixed all our VPN problems a once.

Customer 4: R77.20 to R80.20

Environment:

  • 2 security gateway clusters
    • Open Server
  • 1 security management server
    • Open Server

Problems:

Due to service window and other time restrictions of the customer we did the upgrade for the management server some weeks before the first gateway cluster. Everything went fine and the R77.20 gateways worked very well and without any problem with the security management server.

After that we upgraded the first cluster. First we experienced no serious problems. The customer recognized that he could not reach every interface from his monitoring host. While searching for the reason I also stumbled upon "dropped by fwpslglue_chain Reason: PSL Reject: internal - reject enabled" when doing "fw ctl zdebug drop" (Please honor this: I am able to learn. ;-). Finally, together with TAC we discovered that the reason was a difference in anti-spoofing behaviour.

We still see the debug messages from "fw ctl zdebug drop" but the customer did not identify any issue by now. This does not say that he will not identify them tomorrow. I do not feel really well with the case that the gateway cluster silently drops packages and  sk109777: Traffic is dropped with log: "PSL Reject: internal - reject enabled" after migration to a ... explicitly saying:

Cause

A corruption to the policy files happened during the migration.

I identified that the connections dropped are high port to high port or high port to port 135. This leads to the conclusion that Windows RPC is affected. For that a service request is open at TAC.

I want to mention that TAC and sk109777 are saying different things. The sk109777 states that policy files are corrupted, TAC says that this is one possibility but there may be other reasons. Besides: sk33328: How to clear $FWDIR/state/ directory to resolve policy corruption issues was executed twice and did not change anything.

Customer 5: R77.30 VSX to R80.20

Environment:

  • 2 node VSX VSLS cluster
    • Open Server
  • security management server
    • Open Server

Problems:

Upgrade of management server to R80.20 went without problems. When upgrading the VSX cluster nodes the first node also shows "dropped by fwpslglue_chain Reason: PSL Reject: internal - reject enabled" when doing "fw ctl zdebug drop". My colleague said that the customer experienced massive traffic problems. The problem is, that you cannot switch the VSX cluster object back to R77.30 once you have set it to R80.20, he stated. Always think of snapshotting before doing an upgrade.  :-)

At the moment we have a working VSX cluster node with R77.30 and a dysfunctional one with R80.20 for debugging purposes with the TAC where a service request is open. The customer is not very happy about running a cluster with only one working node when paying for 2 for good reasons. I think he is right.

Customer 6: R77.30 to R80.20

Environment:

  • 2 node security cluster
    • 13500 HPP
  • 1 SandBlast Appliance
    • TE250X
  • 2 security management servers
    • Smart-1 210
    • Open Server on VSX

Problems:

We have problems to even import the files from "migrate export" from R77.30 to a freshly installed management server with R80.20. A service request is open at TAC. We had to delete one object from the policy and got a script that did 2 SQLite deletions. We are still working to check if this fixes the problem.

In pre-upgrade verifier we also see warnings for blades we never had activated in the customers environment. We suspect that their origin is an export from an MDM of the former service partner of the customer. 

Customer 7: R8.10 fresh installation

Environment:

  • 1 security gateway cluster
    • 4200
  • 1 security management server
    • Open Server VMware

Problems:

Rules were ordered by traffic flow directions with layers that implement detailed rules for that directions. Randomly rules in a sub-layer do not work and do not log. The workaround is to put them into the top layer instead. But this cannot be the permanent solution. Service request had not been opened yet.

During lab installation we wanted to disable the policy on the gateway because a connection to a specific interface was not possible. So we did an "fw unloadlocal". After that a connection was still not possible. We discovered that the anti-spoofing settings still stuck to this interface. That seems weird for me.

Customer 8: SmartMove from Cisco ASA to R80.20

Environment:

  • security gateway cluster
    • 5800
  • security management server
    • Open Server

Problems:

No problems at all. Even if we used an older version of ASA software than supported we were able to import a policy, NAT made some troubles. We decided not to use the ASA policy structure and copy-pasted rules from one policy package to another, did new NAT rules and used the imported objects with the weird names to modify rules and groups. Given the poor potential of ASA for meaningful naming SmartMove did a really great job helping us to migrate the customer. A big praise to Check Point for that.

Customer 9: R77.30 VSX with VSLS & MDM to R80.20

Environment:

  • 4 node VSX VSLS cluster
    • Open Server
  • many GAiA and embedded GAiA firewalls
    • not covered by this text
  • 2 multi-domain security management servers
    • Open Server

Problems:

We are at the beginning of the project. My colleague encounters funny messages after "migrate import" in the lab environment. After importing SmartConsole complains about objects with leading or trailing spaces. We are wondering why this is not checked with the pre-upgrade verifier.

Due to the complexity of this installation and the problems we experienced and still experience I suggested to my colleague to postpone this migration until we – and even Check Point! – understand better which problems arise through migration. I am not quite sure if management and sales have the same view on it like me.  ;-)

Some more words

At one customer I forgot to insert the new CPM port (19009) into the policy before upgrading the management server. Stupid me! I accessed it through the firewall cluster – no problem in my test lab at all. In the past I was able to get through a firewall with SSH forwarding to the firewall gateway and connecting to localhost. Seems that this is not possible anymore to avoid man-in-the-middle attacks. Not good for me in this case but a smart move anyway. (We had to patch a cable to the management network. Customer accepted this with a smile. Luckily.)

The policy verifier is a lot more efficient than before. This might take some time after upgrading until you get overlapping rules fixed and can push policy again – depending on the existing policy design and accuracy. I appreciate that better policy verifier. But if you forget to insert port 19009 into your policy (see above) and need to push the policy soon, this could be unnerving.

Conclusions

I am not a friend of beating the supporters at the TAC. They do a difficult job and often covered our asses in the past. But I have to state that my impression is that they are heavy struggling with the problems arising from updates to R80.20. In one service request I felt they were playing games with me to gain time. I do not know how high the load in TAC is due to upgrade problems. But I heard of a partner meeting in Tel Aviv where also other partners complained massively about problems when upgrading to R80.20.

Our problem is the massive lack of information. Check Point must have learned from the problems by now. I cannot believe that no list of Dos, Don'ts and Caveats exist in Tel Aviv. But partners and customers do not know about it. And it seems that even the TAC does not get informed in a manner that they are able to help fast and effectively. This would be a good time to publish a Best Practices for Upgrades to R80.20 primer by Check Point. Maybe there is one. Then I would like to get known to it.

I would suggest that Check Point modifies the pre-upgrade verifier to recognize illegal spaces in object names and to offer an option to eliminate them automatically during "migrate export". This is no rocket science. The spaces should not have made it into the policy in the past and nobody should have to remove them by hand today.

What I have learned from all these updates is, that you gain nothing by upgrading the security management server to R80.20 first and test it with R77.x gateways. We do not experience any serious problem in this combination. All our problems started when it came to R80.x gateways. This said, I even suspect the problem to reside within the security management server itself in many cases, but to get effective only with the new gateways.

At the moment, I have no confidence in the upgrade procedures for VSX and MDM – even in combination! We had so much trouble with simple environments that I do not dare to go to complex ones. 

I would like to know about experiences from other upgrades, problems and solutions. As long as Check Point seems to put a non-disclosure policy on upgrade problems and does not deliver help and solutions, CheckMates is the best place to help each other. I am neither angry nor frustrated. Do not get me wrong. I work with Check Point every day. This is my job and I do it with love. But I think it is time for Check Point to speak with us about problems with and solutions for R80.20 upgrades. I would appreciate that very much – to deliver a better performance to Check Point's and our customers.

7 Replies

Re: Installation & Upgrade problems from R77.x to R80.x

Oliver Fink‌, thanks a lot for this elaborate collection of cases. We have asked relevant departments to look into all mentioned issues.

Re: Installation & Upgrade problems from R77.x to R80.x

Oliver Fink‌, thanks to nice more for this info. You should be hearing from several CP specialists shortly, please do not be surprised. We want to do as much as possible to analyze each of your cases and SRs to make sure we address them properly on multiple levels of QA, development and TAC

Oliver_Fink
Nickel

Re: Installation & Upgrade problems from R77.x to R80.x

Hi, Valeri.

That sounds good and I appreciate this very much. But I am on my last hours for this year and out of office to make a customer happy at the moment. 

It would be fine if we can fix some things in 2019 together with Check Point.

0 Kudos

Re: Installation & Upgrade problems from R77.x to R80.x

Whenever you are ready :-)

Oliver_Fink
Nickel

Re: Installation & Upgrade problems from R77.x to R80.x

We had partial success with customer 6. Got a shell script from TAC that deletes two entries from sqlite. After that and some pre-upgrade verifier cleanup we were able to import to R80.20 security management server in lB. The fun will continue 2019 with a real life upgrade for that customer. No good idea to start this just before Christmas holidays. 

Re: Installation & Upgrade problems from R77.x to R80.x

Hey Oliver

We have the same/similar kind of the issue of the customer 6 & 9.

Pretty frustrating really.

I have the TAC case opened already but felt like they are buying times like you said.

I've gone through the cpm_for_cpdb_xxxxx log and found couple of things. Need to deserialize object (which doesn't exist) and object that is " null"a s the script of the "mds import" of the R80.20 or R80 should be meant to fix this kind of the issue. 

I have also uploaded the exported DB to TAC and also requested tier 3 engineer to investigate this issue to be fixed.

I wonder if the TAC give you a custom script to run to fix the object(s).

If you could share what TAC has told you how to fix the a non-exist object that would be great!

But if it is a custom script - I wonder how many custom scripts they are writing or going to write for each failures :\

Alex

0 Kudos
Oliver_Fink
Nickel

Re: Installation & Upgrade problems from R77.x to R80.x

TAC send us a small shell script that is customized for special object identifiers:

#!/bin/bash

#
# check that we are on Multi-Domain
#
if [ "$MDSDIR" != "" ]; then
#
# We are in MDS, verify that we are under the mdsenv of a CMA
#
if [ "$MDSDIR" = "$FWDIR" ]; then
2>&1 echo -e "This script must be run from a CMA environment. Run \n\tmdsenv yourCmaName\nand then rerun this command."
exit 1
fi
fi

sqlite3 $FWDIR/conf/new_security_rb.sqlite "delete from anti_malware_rulebase_sections where UUID is '{B2CFDA5A-93D2-FE4D-AB90-C68199D52E91}';"
sqlite3 $FWDIR/conf/new_security_rb.sqlite "delete from rulebase_entity_local_instance_table where EntityUid is '{B2CFDA5A-93D2-FE4D-AB90-C68199D52E91}';"
exitCode=$?

if [ $exitCode != 0 ];
then
2>& echo "Operation did not succeed. Please contact support."
else
echo "Done"
fi

exit $exitCode

Thus, I assume it is an individualized script based on a template they use for different customers. TAC still has to analyze your data to get known of your object identifiers.

Sorry for the late answer. Your comment got somehow out of my focus.

0 Kudos