Re: VSX cluster problem - are we alone?

Raphael_Cote · ‎2018-12-20

Hey Check Point community, I need to know if we are alone in the world having so much difficulty implementing Check Point in a VSX cluster mode.

Here's our setup, two 15 600 in a VSX load Sharing mode. 6 vs and about 5000 users. We are using the FW, Anti-Bot, Ant-Virus, URL Filtering, SSL Inspection, and VPN blade. Pretty simple. Version 80.10 jhf 112.

The first time, we did the installation by ourselves, but as we had many problems, Check Point sent here their professionnal service to do the installation beacause the thought we were the problem. A week after he left, the exact same problem came back. They sent a second PS for antoher week without any results. It's been 15 months since we start the installation of Check Point and in the last 8 months I spoke almost daily with level 3 engineer to solve all the problem and after all this time, we still have many bugs. Here's a list :

- DNS problem (Firewall - Domain resolving error. Check DNS configuration on the gateway) - still in problem a year after opening a ticket
- Management console problem, the logs were not displaying. Had to reinstall the management from scratch.
- Update problem, corrution in the registry
- We see the VSX internal IP on our network, which we are not supposed as in the documentation. Problem is still there and no one has been able to explain it to me yet.
- Identity collector stop collecting data from the DC for 5 minutes interrmittently. Never completely resolve, found a parameter to drop the outage at 1 min instead of 5. As we have 3 collectors, it's okay for us not causing us incident, but...
- When we push our Internet security policy, we cause an outage to our TPV transaction. This was a crazy one. An allowed rule was actually dropping trafic but only when we push the policy. Had to add the block destination into the rule to solve the problem!
- Identity awarness problem. This is by far our worst one. Random user lost their Internet access because of the Pepd process that was choking, so missing important information about the user. It tooks 8 months and countless hours to find a solution, a hotfix.
- Unable to update the Ant-Bot, Anti-Virus or URL filtering. There was a problem with Epoch time
- Many problem with process crash. We had core dump for the Fw_full, dnsd, pepd, fw_vsnumber. Some hotfixes created to solve the problem.
- MUH agent on our server was disconnecting. Had to change a key in the registy
- Had to change many parameter in the fwkern.conf because the gateway were choking. This is not a bug as is, but the problem is that it's not documented anywhere how to fine tune the box for 5000 users, even the PS didn't know that.
- Usercheck page problem, it wasn't displaying. It wasn't configured for 5000 users as well, to many request had to change parameter in the httpd.conf file.
- SNMP trap we reveive were incomplete. Had to wait for 4 months to have a fix.
- RAD problem, the service stop respondig (URL Filtering - Rad Service not available). The problem is still there, Check Point is supposed to upgrade their cloud during the Holidays break..
- In the main page of the management, we see a red X saying Identity Awarness serious error for no reason
- In the main page of the management, we see a red X saying Anti-Bot db update fail
- As of now, our SSL inspection is not working well (Internal system error in HTTPS Inspection (Couldn't start inspection)). Our Internet access is slow as ....
- As of now, the NTP synchronozation as stop working on our gateway. The configuration is there, but there just nothing happening. Was working before but stop all of a sudden
- As of now, if I do a cpinfo -y all on my gateway, I can't see all the hotfix that are installed on it. Problem with the build.

I'd like to tell you that it's exaggerated, but in fact I probably forgot some bug that we had, this list is the strict minimum.

Is there someone who has pretty much that setup and it's working well?

cstueckrath · ‎2018-12-20

We are running a similar setup:

2x 15400 VSX VSLS, 80.10 JHF T154

- DNS problem (Firewall - Domain resolving error. Check DNS configuration on the gateway) - still in problem a year after opening a ticket

Seeing this sometimes

- Management console problem, the logs were not displaying. Had to reinstall the management from scratch.

- Update problem, corrution in the registry

Did not have those

- We see the VSX internal IP on our network, which we are not supposed as in the documentation. Problem is still there and no one has been able to explain it to me yet.

Same here. Annoying

- Identity collector stop collecting data from the DC for 5 minutes interrmittently. Never completely resolve, found a parameter to drop the outage at 1 min instead of 5. As we have 3 collectors, it's okay for us not causing us incident, but...

No problems so far

- When we push our Internet security policy, we cause an outage to our TPV transaction. This was a crazy one. An allowed rule was actually dropping trafic but only when we push the policy. Had to add the block destination into the rule to solve the problem!
- Identity awarness problem. This is by far our worst one. Random user lost their Internet access because of the Pepd process that was choking, so missing important information about the user. It tooks 8 months and countless hours to find a solution, a hotfix.
- Unable to update the Ant-Bot, Anti-Virus or URL filtering. There was a problem with Epoch time
- Many problem with process crash. We had core dump for the Fw_full, dnsd, pepd, fw_vsnumber. Some hotfixes created to solve the problem.
- MUH agent on our server was disconnecting. Had to change a key in the registy

Did not have those, either

- Had to change many parameter in the fwkern.conf because the gateway were choking. This is not a bug as is, but the problem is that it's not documented anywhere how to fine tune the box for 5000 users, even the PS didn't know that.

same here

- Usercheck page problem, it wasn't displaying. It wasn't configured for 5000 users as well, to many request had to change parameter in the httpd.conf file.

what did you change? We see similar things (200 Users...)

- SNMP trap we reveive were incomplete. Had to wait for 4 months to have a fix.
- RAD problem, the service stop respondig (URL Filtering - Rad Service not available). The problem is still there, Check Point is supposed to upgrade their cloud during the Holidays break..
- In the main page of the management, we see a red X saying Identity Awarness serious error for no reason
- In the main page of the management, we see a red X saying Anti-Bot db update fail
- As of now, our SSL inspection is not working well (Internal system error in HTTPS Inspection (Couldn't start inspection)). Our Internet access is slow as ....
- As of now, the NTP synchronozation as stop working on our gateway. The configuration is there, but there just nothing happening. Was working before but stop all of a sudden
- As of now, if I do a cpinfo -y all on my gateway, I can't see all the hotfix that are installed on it. Problem with the build.

Yeah, some of this sounds familiar...

Raphael_Cote · ‎2018-12-21

Thanks for the reply Christian. Check sk85040 for your usercheck page problem

PhoneBoy · ‎2018-12-20

Any TAC case(s) on the above issues?

Raphael_Cote · ‎2018-12-21

Yes most of them have a TAC case, you want more detail?

PhoneBoy · ‎2018-12-21

Please send the TAC cases in PM.

Louis_Poulin · ‎2018-12-21

Thanks for the help Dameon!

And it'd still be interesting to know if you are aware of other deployment of this kind in the world. Like Raphael said, we are under the impression that we are pretty much alone… and most of the errors encountered seems to require new hotfix to be resolved. So it's easy to feel like lonely guinea pigs.

PhoneBoy · ‎2018-12-21

Every setup is a little different and thus the issues may either be non-existent or different.

The issue with VSX "funny IPs" has come up a couple times on CheckMates threads, the other ones I'm personally less familiar with.

Louis_Poulin · ‎2018-12-21

I totally understand that every setup is a little different and I agree.

Maybe I can ask my question in a different way. In organizations having more than 5000 users browsing the web, do you have an idea of how common is a VSX Cluster running R80.10 with all the aformentioned blades active?

Are organizations still on R77.30? Or are they not using VSX? What is the most common setup these days?

JozkoMrkvicka · ‎2018-12-22

All of mentioned issues started after upgrading to R80.10, or were present also on R77.30 ?

What about R80.20 ?

Kind regards,
Jozko Mrkvicka

Raphael_Cote · ‎2019-01-03

As it's a new deployment, we started at version 80.10

Martin_Valenta · ‎2018-12-27

which fwkern.conf parameters were suggested to be modified and why?

Raphael_Cote · ‎2019-01-03

Those parameter were change because of general performance problem, I can't be more specific I did it with the TAC and I don't have much more detail :

fwha_enable_state_machine_by_vs=1

fwha_freeze_state_machine_timeout=200

fwha_add_vsid_to_ccp_mac=1

fwha_forw_packet_to_not_active=1

fwmultik_input_queue_len=4096

Roy_Smith · ‎2018-12-31

Thanks for posting this. I have had similar issues with a 2 * 23500 VSX cluster with 11 * VS and 5000 users. We are also using all the blades you mention plus Application Control. We have had many of the issues that you mention. Many of the VS'es are small but we have 1 VS that is our Internet gateway and there are definite performance issues with it when there is an increase in the number of users. Case in point is the past week. During the holidays, we have less than half our users in offices and there have been no issues accessing Internet sites.

This was installed originally by a CP partner, who I do have complete faith in. However, even they were unaware of some fundamental "gotchas". For example, it took several weeks to discover that the VS were all in 32 bit mode by default (Why???) but this is not documented very well. Switching to 64-bit, obviously improved things considerably but we still have various issues.

I have had various TAC calls open and almost always get pointed to an SK article to make a CLI tweak, in particular fwkern.conf. I have also installed a newer JHF 3 times in the past 6 months at there recommendation. There seems to have been improvements but next week will be the real test

It is nice to know that I'm not alone with this.

Thanks

Roy

PhoneBoy · ‎2018-12-31

Just to be clear, not ALL of these issues are necessarily related to VSX.

The 32-bit VS issue is definitely VSX-specific and goes away in R80.20 since it is no longer possible to run VSes in 32-bit mode

Raphael_Cote · ‎2019-01-03

Let's keep in touch Roy in the next week to see how it goes. All I can say is that even with a bigger model of appliance than ours you have the same problem, so the problem is most likely due to software problem than a physical one.. BTW we are pretty much like you, one big Internet VS with all the problem and all the other small one doesn't have them. During the holidays everything is fine and when the load will be bigger next week the problem should reappear.

PhoneBoy · ‎2019-01-03

Is your vs bits also set to 32?

64 bit VS is definitely needed if you have a large VS (As is support for more than 10 cores in a VS).

Raphael_Cote · ‎2019-01-03

No we are at 64 bits

Roy_Smith · ‎2019-01-08

Hi Raphael

So far this week, everything is actually running great. I am now seeing the level of users, connections and traffic that I would expect with everyone back to work. Performance of the VS is fine and accessing internet sites is as snappy as I would expect. I'm hesitant to say the issues are resolved so will continue monitoring the situation.

One thing I did do, the week before the holidays, was to install JHF Take 154. It may be that there are some performance enhancements in the hotfix, which have helped things. I guess it's just a waiting game for the rest of the week

Roy

Raphael_Cote · ‎2019-01-08

Thanks for the follow up Roy! Unfortunately, on my side there is no improvement for my SSL problem, more than 100K errors today. I'll try to install a fix this evening, hopefully it will help. I can't install the latest JHF either as we have 5-6 personalized hotfix to solve other problem that we had, so we are stuck. I'd also like to migrate to 80.20, but everybody are scared of it, even the TAC.

Chris_Atkinson · ‎2019-01-11

Raphael & Roy,

Sorry to hear you've experienced some challenges with your deployments.

Out of interest in both of your environments has RAD been tuned as follows?

* Increase RAD cache for more than 1,000 users (sk90422)

* Enable RAD connection reuse (sk103422)

CCSM R77/R80/ELITE

Louis_Poulin · ‎2019-01-22

So I guess that's it then; only a handful (3 or 4) of Check Point customers are running a setup similar to this one?

Hopefully, R80.20 will improve stability and performance on this particular setup.

Chris_Atkinson · ‎2019-01-22

Suffice to say that probably not all of the relevant customers participate actively here but I know of many.

The first rule of fight club... tends to apply a lot in security!

CCSM R77/R80/ELITE

Josh_Wilson · ‎2019-02-05

We have a very, very similar experience. Our environment is roughly 2400 users, VSX R80.10 on 5900 series appliances. I could copy/paste almost every one of your issues with only a slight variance. And just like others, the primary issues affect our web gateway VS running multiple blades. But the most noticeable impact to our users is due to HTTPS inspection failures/errors. We have had multiple cases open with CP and our current one is due to the loss of logging on the management server (suddenly cannot get the logs from the log server). And unfortunately, the problems continue to stack up as we receive errors trying to apply the latest JHFA. I'm very tempted to install 80.20....

Raphael_Cote · ‎2019-02-06

Just a follow up. We installed version 80.20 ten days ago approximately with the latest JHF. I can't say it was a piece of cake even if we had the support of the TAC doing it. At fisrt we did a upgrade on the management and the gateway. Everything went fine for the management, but we had trouble with both gateway, VS 0 was in problem even if the upgrade finished with success. Also, after the migration, we were unable to have both members of our clusters up at the same time, so we had to close one. There were a lot of Internet up and down while both members are up.. Still not sure what is the exact problem... And we had a major VPN issue with our VSEC, the BGP route exchange wasn't working anymore, so we were unable to reach our cloud servers and services. Took 3 days and major headache with this major issue in my organization to finally find the problem, had to add a specific route that wasn't necessary in 80.10. (And BTW, all the BGP configuration disapear when you do vsx_util reconfigure, important to know...).

We finally did a fresh install on both gateway as we thought it would solve our problem. The fresh install work fine, solve the VS 0 problem, but we still can't have both member up at the same time.

I'm not sure yet, but we had 2 episode of Identity Awarness problem since the installation. The fix we had in 80.10 is supposed to be included, but can't say for sure now that it's working fine, I have doubt.

The RAD problem is also still there, we are working with Check Point for a hotfix to solve that, but it's not easy, it will be the fourth that we will try!!

But the Internet speed has improve and we have less SSL error, that was the main reason why we did the upgrade. It's though to say because I can't say that we are stable yet since the migration, but even if we had to go through a rough time, I think the new version will be helpful!

_Val_ · ‎2019-02-06

Concerning BGP settings, it is expected that those have to be re-introduced after vsx_util reconfigure. These are OS level changes, they are not captured on the management side

Roy_Smith · ‎2019-02-08

Since we installed the JHF, performance improved and after over a month, performance is still fine and internet access is much more stable and reliable. We still have odd issues, but these appear to be down to specific sites, which we would probably get anyway, so we are working through these.

I am planning to upgrade to R80.20 in a few months, once a few more JHF updates have been released. I do feel the pain regarding issues with hotfixes applied causing issues with upgrades. We ended up in this state with our R77 environment, although it made sense to install R80.10 from scratch anyway. I am now loathed to apply custom hotfixes. Perhaps the best option is to install the JHF updates on a more regular basis.

Christian_Riede · ‎2019-02-07

R77.30 ClusterXL, R80.10 VSX. Firewall, Application Control, IPS, SSl Inspection, Anti Malware, Threat Emulation, Identity Awareness (Captive portal, Identity agent, kerberos). 23800 and 21800, thousands of users. Similar situation here. Lots of private hotfixes for critical problems that do not make it into the Jumbo. Lot of fine tuning in all of the mentioned areas. Support tickets on those issues take long and get finally solved only after massive escalation. Checkpoint PS was very helpful.

Louis_Poulin · ‎2019-02-07

I'd be curious to hear feedback of people running the Check Point solution (R80.x) on physical gateways instead of virtual ones. Is it more stable?

From what I'm hearing here and there, it seems like the VSX mode is not as robust when you use all the blades and when you have more than 1000 users.

Philip_W · ‎2019-02-08

Our customer has been running R77.30 VSX/VSLS Openserver on HP servers without major issues (as far as I know - I haven't been working for them that long yet). I estimate they have 1000+ users, but only some blades are active.

We are planning to upgrade to R80.20 soon, as well as revise part of the overcomplicated current design. I'm going to suggest activating more blades too.

Are you a member of CheckMates?

VSX cluster problem - are we alone?