Raphael Cote

VSX cluster problem - are we alone?

Discussion created by Raphael Cote on Dec 20, 2018
Latest reply on Jan 11, 2019 by Chris Atkinson


Hey Check Point community, I need to know if we are alone in the world having so much difficulty implementing Check Point in a VSX cluster mode.

 

Here's our setup, two 15 600 in a VSX load Sharing mode. 6 vs and about 5000 users. We are using the FW, Anti-Bot, Ant-Virus, URL Filtering, SSL Inspection, and VPN blade. Pretty simple.  Version 80.10 jhf 112.

 

The first time, we did the installation by ourselves, but as we had many problems, Check Point sent here their professionnal service to do the installation beacause the thought we were the problem. A week after he left, the exact same problem came back. They sent a second PS for antoher week without any results. It's been 15 months since we start the installation of Check Point and in the last 8 months I spoke almost daily with level 3 engineer to solve all the problem and after all this time, we still have many bugs. Here's a list :

 

- DNS problem (Firewall - Domain resolving error. Check DNS configuration on the gateway) - still in problem a year after opening a ticket 
- Management console problem, the logs were not displaying. Had to reinstall the management from scratch.
- Update problem, corrution in the registry
- We see the VSX internal IP on our network, which we are not supposed as in the documentation. Problem is still there and no one has been able to explain it to me yet.
- Identity collector stop collecting data from the DC for 5 minutes interrmittently. Never completely resolve, found a parameter to drop the outage at 1 min instead of 5. As we have 3 collectors, it's okay for us not causing us incident, but...
- When we push our Internet security policy, we cause an outage to our TPV transaction. This was a crazy one. An allowed rule was actually dropping trafic but only when we push the policy. Had to add the block destination into the rule to solve the problem! 
- Identity awarness problem. This is by far our worst one. Random user lost their Internet access because of the Pepd process that was choking, so missing important information about the user. It tooks 8 months and countless hours to find a solution, a hotfix.
- Unable to update the Ant-Bot, Anti-Virus or URL filtering. There was a problem with Epoch time 
- Many problem with process crash. We had core dump for the Fw_full, dnsd, pepd, fw_vsnumber. Some hotfixes created to solve the problem.
- MUH agent on our server was disconnecting. Had to change a key in the registy
- Had to change many parameter in the fwkern.conf because the gateway were choking. This is not a bug as is, but the problem is that it's not documented anywhere how to fine tune the box for 5000 users, even the PS didn't know that.
- Usercheck page problem, it wasn't displaying. It wasn't configured for 5000 users as well, to many request had to change parameter in the httpd.conf file.
- SNMP trap we reveive were incomplete. Had to wait for 4 months to have a fix. 
- RAD problem, the service stop respondig (URL Filtering - Rad Service not available). The problem is still there, Check Point is supposed to upgrade their cloud during the Holidays break..
- In the main page of the management, we see a red X saying Identity Awarness serious error for no reason
- In the main page of the management, we see a red X saying Anti-Bot db update fail
- As of now, our SSL inspection is not working well (Internal system error in HTTPS Inspection (Couldn't start inspection)). Our Internet access is slow as ....
- As of now, the NTP synchronozation as stop working on our gateway. The configuration is there, but there just nothing happening. Was working before but stop all of a sudden
- As of now, if I do a cpinfo -y all on my gateway, I can't see all the hotfix that are installed on it.  Problem with the build.

 

I'd like to tell you that it's exaggerated, but in fact I probably forgot some bug that we had, this list is the strict minimum.

 

Is there someone who has pretty much that setup and it's working well?

Outcomes