Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Jan_Kleinhans
Advisor

User Poll: Experience with VSX machines and R81.20

Hello everyone,

We are running 2 VSX clusters under R81.20 on Checkpoint Applications (originally 15600/23800, now 19200). One of them with almost all blades on, the other one only FW,IPS,IA,VPN. We actually have new problems with every single JHF. Sometimes VPN no longer works properly, sometimes there are problems with HTTPS inspection, sometimes clustering no longer works correctly, sometimes packets are lost (without log entry) etc.. There have now been around 25 different cases in the last 2 years. Nearly all of them are problems which needed a hotfix.

 


To the VSX users: What is your experience with VSX under R81.20? Are we the only ones who get almost every bug?

Jan

11 Replies
Chris_Atkinson
Employee Employee
Employee

That's certainly not the rate of VSX specific issues I'm accustomed to hearing from my customers. 

For context how early did you adopt R81.20, from which JHF?

Has the environment been reviewed by Check Point Professional Services at all?

CCSM R77/R80/ELITE
0 Kudos
Jan_Kleinhans
Advisor

We started with T41

Yes PS has checked the environment several times. Last time 3 weeks ago. Everything is fine. 

As I mentioned, all problems have been fixed by a hotfix after a while. As anybody can see every JHF cotins a ton of bugfixes. And we seem to catch one of the problems that will be fixed later nearly in every JHF.
For example:

PRHF-31092
sk182494

At the moment we cannot install policy because every time we do it we have distortions in MS Teams communication. 
Because of that we updated one member to T89 bacause all debugs didn't help. In T89 we have a new problem that one VS doesn't check all vlan interfaces anymore an says that it has fever cluster interfaces. So we cannot check if T89 fixes the Teams problem.
The funny thing is, that the Teams policy install problem occurs on 2 VSX Clusters with different configuration. One of them only has FW/IPS enabled.
But these are problems we have. I wanted to start this thread to figure out if other customers or partners experience similar problem count.

Regards,

Jan

 

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Thanks for your insights, sk182494 & PRHF-31092 as examples are not specific to VSX.

How is the connection persistence configured for both systems as a point of interest?

In some rare scenarios sk182653 might be relevant.

For the cluster interface issue are there differences in fwkern.conf parameters (sk92826) between members?

CCSM R77/R80/ELITE
0 Kudos
Jan_Kleinhans
Advisor

It have not to be VSX related but we "feel" that VSX makes problems often more complicated. Maybe we would have the same issues if all machines would be native.

It's keep all connections. sk182653 was not known. I will have an eye on it. The funny thing is, that the packet loss in the MSTeams stream starts after SmartConsole already says policy installation finished. 


There are no differences in fwkern.conf. Both machines worked till the T89. There has been a fix in T85 which has something to do with interfaces. Maybe this fix causes our issue. ( PRHF-27989 After modifying a bond, the Monitored VLANs may disappear.)
I already gave this clue to the supporter in our case.

 

Thanks for your support.

VSX_Bernie
Contributor

Hello Jan,

I cannot really attest to R81.20, but we are running 5 VSX clusters on R81.10.
I do not think we have had as many as 25 different cases in the almost three years they have been running, but it is somewhere up there.

Many of the issues we encountered, I remember seeing in the SKs that it was fixed for both R81.10 and R81.20.

When you mention "sometimes VPN no longer works properly", I immediately think of sk182648 that we were affected by quite recently, when we installed JHF 156 on one of the clusters. It would break all IKEv2 tunnels on every VS, because every phase 2 renegotiation would initially fail causing downtime. I can see from the SK that R81 through R81.20 was affected by this.

I remember that we were about a year in on our first 3 clusters, before we came onto the first JHF that truly felt stable (think it was Take 78 or maybe 95).

So it felt like R81.10 was maybe like 1,5 years old (or somewhere around that), before it had matured?
Even so we have had several cases since the first stable JHF, were we upgraded to a later JHF that contained bugs that needed a hotfix.

Unfortunately it has become quite regular for us to have to uninstall a custom hotfix every time we deploy a new JHF, because we almost always have custom hotfixes installed.

Once TAC even had to create a custom hotfix that would integrate with another hotfix we already had installed, because a JHF introduced multiple bugs that were business-breaking for us.

I don't remember many exact cases where only VSX was affected though - most of the issues were for all Quantum Gateways.

I don't know though - to us it seems like people who run VSX are just more affected by bugs? In all fairness though, it may just be the fact that we sometime service larger environments, because we run many VS as opposed to single GWs.

We are actually looking to upgrade to R81.20, due to the EOS of R81.10 in the summer of 2025, so your insights on R81.20 are greatly appreciated.

0 Kudos
RamGuy239
Advisor
Advisor

From my experience, the industry as a whole struggles more with bugs and quality assurance, than before. I don't think there is an easy answer to why it has become like this, but one thing to keep in mind is how fast everything is moving these days compared to just a few years ago. And its not only the vendors fault, as everything surrounding the firewall is also constantly moving and evolving. Suddenly Microsoft releases a Windows Server patch with some RADIUS hardening, causing firewall vendors having to release patches to ensure RADIUS traffic keeps working, etc.

Just take a look at the Palo Alto and Fortinet communities. People telling to stay far away from PanOS 11.x.x releases, stay on 10.1 originally released back in 2021. Fortinet is the same, stay far away from FortiOS 7.6.0, if you are cutting-edge you might attempt 7.4.x, but otherwise stay with 7.0.x, also originally released back in 2021.

Fortinet is actively supporting three versions of FortiOS, Palo Alto is actively supporting five versions of PanOS (!), Check Point is actively supporting four versions of Gaia, soon to be three.

 

When it comes to VSX, things are changing quite a bit with R82 and VSnext.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME
Jan_Kleinhans
Advisor

We had also many issues with R81.10 before. I think R81.10 and R81.20 share many code so that the issues are happening on both versions.
Thank you for your experience. I see that we seem to have similar problems. As our cases often take a long time to be solved we sometimes think that we are the only ones with such problems.

Regards,

Jan

 

RamGuy239
Advisor
Advisor

From my experience deploying a lot of Check Point installations, R80.40, R81, R81.10 and R81.20 all share much of the same code. The difference are the new features introduced with each new version, which of course, will be specific for that version and every version coming after.

If you look at the changelog for various Jumbo Hotfix Accumulator releases, you will notice they all share most of the same fixes. The same fixes will be showing up in JHF notes for R80.40, R81, R81.10 and R81.20, attesting to how they share similar code, and receive the same fixes. Which also means if fix A introduces problem B, this will most likely happen across all versions as they share such similar code. Unless the fix and the introduced bug is isolated to a feature existing only in a later version.

R80.40, R81 and R81.10 all share the same kernel, and are all based on the same main Red Hat Enterprise version. There is a slight iteration with R81.20, but just a small one.

R82 is a new leap in kernel version, and is based on a new main Red Hat Enterprise version.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME
VSX_Bernie
Contributor

@RamGuy239  - I just want to say that I am fully with you there. I also do not think that it is entirely the vendors fault. There is a high number of different circumstances causing this - but I also think a lot of it has to do with how fast things are moving. If R&D divisions of the vendors are forced to more frequently release updates than they would actually like, to fix different CVEs and implement important security features, then inevitably there will also be more bugs.

Also you mentioned the RADIUS thing - just recently (i think it was the start of this month) Microsoft released a major update to Windows 11, which broke Endpoint VPN for many users:
https://support.checkpoint.com/results/sk/sk182749

So yeah - these sort of things happen quite frequently, which must make it hard for firewall vendors to keep up.

I concur - I have tried looking through the different versions Take notes, and it is plain to see that the same PRHF are mentioned across the board.


@Jan_Kleinhans  - You are most certainly welcome. We have the exact same feeling sometimes, that we are the only ones really affected with both bugs and long-drawn TAC cases. It is nice to know we are not alone in this.

0 Kudos
Henrik_Noerr1
Advisor

Hey Jan,

We have the exact same experience - many times wondering why we are the first to get hit by this 'new issue'.

We have a very large environment based on many Lenovo Open Servers all running VSX.

Some issues we have seen - not all VSX specific;

- corexl dynamic balancing causing spontaneous reboots on appliance

- high load on large VSX clusters (sk181891)

- CPD using 100% cpu on gateways, destroying SIC, blocking any policy install (lasted for 3-4 months before a fix)

- changing funny ip range to a /20 causing all VS to lose ip addresses - that was a fun night 🙂

- very very long reboot times (better now in the newer jumbos)

- deleting an interface in SMC causes *another* interface to be deleted.

- installing policy causing high load with packet loss on VS (until another policy push is done)

- FEC causing interfaces not coming online

- VPNs stops working if passing another VS with securexl enabled.

- deleting (non monitored) vlans causing failovers

- Hit counter not returning correct values.

- running cpinfo causing reboots

 

Some issues mentioned was never seen again, some were folded into JHF, others we have private fixes for having continously portfixed to newer JHFs for I do not know how long and lastly some items we no longer perform - ie changing VSX Private IP range, we rather spin up a new VS or buy a new cluster than risking a full cluster down.  

All of the above have in general eroded a lot of trust in the platform across the organisation. 

/Henrik

 

 

 

VSX_Bernie
Contributor

@Henrik_Noerr1  - You probably already have somthing for the FEC - but just in case:

We had the same issue when we tried to update to Take 110 of R81.10.
We found that the only viable solution was to change to FEC108 on the switches connecting GWs.

We have not had issues since.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events