Our customer has an MDS server managing 15 CMAs. Each CMA has its own SG and SG cluster. There is a mesh global VPN community between the managed SGs/SG clusters, so there is a S2S VPN between those Check Point Security Gateway peers. Along with the MDS, there is a Multi-Domain Log Server (MLM) installed. The MDS, MLM and th SGs have been running on version R80.40 and R80.40 JHF Take_158 till last Saturday, when the MDS server (including the MLM) was upgraded to R81.10 and on top of it, R81.10 JHF Take_66 have been installed. The SGs are still on version R80.40 and having R80.40 JHF Take_158 applied. All global policy assignments and policy installs had been successful after the MDS upgrade.
However, after performing the global policy assignment and policy install on the SG cluster and separate MAB gateway (running on a different SG than the main SG cluster) belonging to the CMA representing the our customer's HQ, all C2S IPSec VPN connection started failing while the VPN client was showing "Site not responding". The MAB portal on the MAB gateway also started showing 'invalid user'. We performed some debugs on vpnd and cvpnd as well, in the output, the most interesting part is "I have no certificate to use for IKE". We do not understand why the successful upgrade why could break such thing, if this was the root cause at all. On the cvpnd side, the one and only line we were getting during the debug was "Exception: Failed to initialize Encryption Key Pair of ICA gateway certificate - CVPND aborting". Based on our understanding this also indicates some problem with the certificate. Due to the above caused major business impact to our customer, after performing some debugging and evidence collection, we decided to revert back to R80.40. After reverting the MDS and MLM to this version and performing GPA and PI on the affected SG cluster and MAB gateway, the previously broken C2S connections started working. It is important to note that no other changes have been performed on the deployment, simply upgrading the management of the firewall and then doing a GPA and PI on the SGs were enough to break the system.
Have you ever seen similar behavior before? Any comments would be appreciated.