Re: User Poll: Experience with VSX machines and R8...

Jan_Kleinhans · ‎2024-10-22

Hello everyone,

We are running 2 VSX clusters under R81.20 on Checkpoint Applications (originally 15600/23800, now 19200). One of them with almost all blades on, the other one only FW,IPS,IA,VPN. We actually have new problems with every single JHF. Sometimes VPN no longer works properly, sometimes there are problems with HTTPS inspection, sometimes clustering no longer works correctly, sometimes packets are lost (without log entry) etc.. There have now been around 25 different cases in the last 2 years. Nearly all of them are problems which needed a hotfix.

To the VSX users: What is your experience with VSX under R81.20? Are we the only ones who get almost every bug?

Jan

Chris_Atkinson · ‎2024-10-23

That's certainly not the rate of VSX specific issues I'm accustomed to hearing from my customers.

For context how early did you adopt R81.20, from which JHF?

Has the environment been reviewed by Check Point Professional Services at all?

CCSM R77/R80/ELITE

Jan_Kleinhans · ‎2024-10-23

We started with T41

Yes PS has checked the environment several times. Last time 3 weeks ago. Everything is fine.

As I mentioned, all problems have been fixed by a hotfix after a while. As anybody can see every JHF cotins a ton of bugfixes. And we seem to catch one of the problems that will be fixed later nearly in every JHF.
For example:

PRHF-31092
sk182494

At the moment we cannot install policy because every time we do it we have distortions in MS Teams communication.
Because of that we updated one member to T89 bacause all debugs didn't help. In T89 we have a new problem that one VS doesn't check all vlan interfaces anymore an says that it has fever cluster interfaces. So we cannot check if T89 fixes the Teams problem.
The funny thing is, that the Teams policy install problem occurs on 2 VSX Clusters with different configuration. One of them only has FW/IPS enabled.
But these are problems we have. I wanted to start this thread to figure out if other customers or partners experience similar problem count.

Regards,

Jan

Chris_Atkinson · ‎2024-10-23

Thanks for your insights, sk182494 & PRHF-31092 as examples are not specific to VSX.

How is the connection persistence configured for both systems as a point of interest?

In some rare scenarios sk182653 might be relevant.

For the cluster interface issue are there differences in fwkern.conf parameters (sk92826) between members?

CCSM R77/R80/ELITE

Jan_Kleinhans · ‎2024-10-24

It have not to be VSX related but we "feel" that VSX makes problems often more complicated. Maybe we would have the same issues if all machines would be native.

It's keep all connections. sk182653 was not known. I will have an eye on it. The funny thing is, that the packet loss in the MSTeams stream starts after SmartConsole already says policy installation finished.

There are no differences in fwkern.conf. Both machines worked till the T89. There has been a fix in T85 which has something to do with interfaces. Maybe this fix causes our issue. ( PRHF-27989 After modifying a bond, the Monitored VLANs may disappear.)
I already gave this clue to the supporter in our case.

Thanks for your support.

VSX_Bernie · ‎2024-10-24

Hello Jan,

I cannot really attest to R81.20, but we are running 5 VSX clusters on R81.10.
I do not think we have had as many as 25 different cases in the almost three years they have been running, but it is somewhere up there.

Many of the issues we encountered, I remember seeing in the SKs that it was fixed for both R81.10 and R81.20.

When you mention "sometimes VPN no longer works properly", I immediately think of sk182648 that we were affected by quite recently, when we installed JHF 156 on one of the clusters. It would break all IKEv2 tunnels on every VS, because every phase 2 renegotiation would initially fail causing downtime. I can see from the SK that R81 through R81.20 was affected by this.

I remember that we were about a year in on our first 3 clusters, before we came onto the first JHF that truly felt stable (think it was Take 78 or maybe 95).

So it felt like R81.10 was maybe like 1,5 years old (or somewhere around that), before it had matured?
Even so we have had several cases since the first stable JHF, were we upgraded to a later JHF that contained bugs that needed a hotfix.

Unfortunately it has become quite regular for us to have to uninstall a custom hotfix every time we deploy a new JHF, because we almost always have custom hotfixes installed.

Once TAC even had to create a custom hotfix that would integrate with another hotfix we already had installed, because a JHF introduced multiple bugs that were business-breaking for us.

I don't remember many exact cases where only VSX was affected though - most of the issues were for all Quantum Gateways.

I don't know though - to us it seems like people who run VSX are just more affected by bugs? In all fairness though, it may just be the fact that we sometime service larger environments, because we run many VS as opposed to single GWs.

We are actually looking to upgrade to R81.20, due to the EOS of R81.10 in the summer of 2025, so your insights on R81.20 are greatly appreciated.

RamGuy239 · ‎2024-10-24

From my experience, the industry as a whole struggles more with bugs and quality assurance, than before. I don't think there is an easy answer to why it has become like this, but one thing to keep in mind is how fast everything is moving these days compared to just a few years ago. And its not only the vendors fault, as everything surrounding the firewall is also constantly moving and evolving. Suddenly Microsoft releases a Windows Server patch with some RADIUS hardening, causing firewall vendors having to release patches to ensure RADIUS traffic keeps working, etc.

Just take a look at the Palo Alto and Fortinet communities. People telling to stay far away from PanOS 11.x.x releases, stay on 10.1 originally released back in 2021. Fortinet is the same, stay far away from FortiOS 7.6.0, if you are cutting-edge you might attempt 7.4.x, but otherwise stay with 7.0.x, also originally released back in 2021.

Fortinet is actively supporting three versions of FortiOS, Palo Alto is actively supporting five versions of PanOS (!), Check Point is actively supporting four versions of Gaia, soon to be three.

When it comes to VSX, things are changing quite a bit with R82 and VSnext.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Jan_Kleinhans · ‎2024-10-24

We had also many issues with R81.10 before. I think R81.10 and R81.20 share many code so that the issues are happening on both versions.
Thank you for your experience. I see that we seem to have similar problems. As our cases often take a long time to be solved we sometimes think that we are the only ones with such problems.

Regards,

Jan

RamGuy239 · ‎2024-10-24

From my experience deploying a lot of Check Point installations, R80.40, R81, R81.10 and R81.20 all share much of the same code. The difference are the new features introduced with each new version, which of course, will be specific for that version and every version coming after.

If you look at the changelog for various Jumbo Hotfix Accumulator releases, you will notice they all share most of the same fixes. The same fixes will be showing up in JHF notes for R80.40, R81, R81.10 and R81.20, attesting to how they share similar code, and receive the same fixes. Which also means if fix A introduces problem B, this will most likely happen across all versions as they share such similar code. Unless the fix and the introduced bug is isolated to a feature existing only in a later version.

R80.40, R81 and R81.10 all share the same kernel, and are all based on the same main Red Hat Enterprise version. There is a slight iteration with R81.20, but just a small one.

R82 is a new leap in kernel version, and is based on a new main Red Hat Enterprise version.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

VSX_Bernie · ‎2024-10-24

@RamGuy239 - I just want to say that I am fully with you there. I also do not think that it is entirely the vendors fault. There is a high number of different circumstances causing this - but I also think a lot of it has to do with how fast things are moving. If R&D divisions of the vendors are forced to more frequently release updates than they would actually like, to fix different CVEs and implement important security features, then inevitably there will also be more bugs.

Also you mentioned the RADIUS thing - just recently (i think it was the start of this month) Microsoft released a major update to Windows 11, which broke Endpoint VPN for many users:
https://support.checkpoint.com/results/sk/sk182749

So yeah - these sort of things happen quite frequently, which must make it hard for firewall vendors to keep up.

I concur - I have tried looking through the different versions Take notes, and it is plain to see that the same PRHF are mentioned across the board.

@Jan_Kleinhans - You are most certainly welcome. We have the exact same feeling sometimes, that we are the only ones really affected with both bugs and long-drawn TAC cases. It is nice to know we are not alone in this.

Thomas_Eichelbu · ‎2024-11-08

Well, i must also admit we have extreme issues with VSX on Maestro in VSLS ...
not on all but on some installations.

i feel we ran into every bug you can imagine, and we also had at least 20 or more cases so far.

+ permanent performance issues and little traffic outages, nobody can find and diagnose
+ changing bond members caused all IPV6 routes to disappear
+ adding an interface to a VS causes the whole security group to reboot.
+ often we have boot loops of SGM´s
+ SGM needs many many reboots to come alive
+ strange messages in fwk.elg and everybody says they are cosmetic
+ cpview shows total incorrect values for interfaces, sometimes values higher then the physical connections
+ ASG Alert send fantasy values
+ permanent issues with IPV6
+ Dynamic Split switches between on/off
+ Skyline plugin on the SGM was corrupt and triggered reboot loops

installing and uninstalling hotfixes takes hours and hours.

many many issues got fixed already, Check Point helps alot i must admit, some TAC really show commitment.
we also have advocacy support on board ...
but its really a miracle to me what is running so badly here.

great respect to our customer who endures all of this!

VSX_Bernie · ‎2024-11-08

@Thomas_Eichelbu - Can you give an example or two of the cosmetic messages from fwk.elg?
I think we have had some in our time as well - just curious if you see the same.

Thomas_Eichelbu · ‎2024-11-08

Hello,

well mostly its this:

Sep 9 12:30:58 2024 HOSTNAME kernel:[fw4_0];fwmultik_prio_handle_gconn_lookup: gconn lookup failed for connection 41.6.1.32(805307664) -> 0.0.0.0(1426063360) IPP 443 instance 3
Sep 9 12:30:58 2024 HOSTNAME kernel:[fw4_0];fwkdrv_enqueue_data_user_ex: error in gconn lock and lookup. cannot enqueue to priority queues. (instance 3, opcode:7)

some said its cosmetic, some said we need to investigate.
alot of rumors are made upon this messages.
this IP´s you see are bogus IP`s somehow they get created randomly, they never correspond to anything we see in Smartlog or whatever
we always see this message during high load and then we feel outages.
this GCONN haunts us since the beginning, R81.10 with low HFA´s ... still in R81.20 HFA89 its here, but noch much as before but still existing.

recently i see this ... but no time yet to dig deeper.

[7 Nov 9:52:37][fw4_0];[vs_1];fw_xlate_cliside_server failed
[7 Nov 9:52:38][fw4_0];[vs_1];fw_xlate_cliside_server failed
[7 Nov 9:52:51][fw4_0];[vs_1];fw_xlate_cliside_server failed
[7 Nov 9:52:59][fw4_0];[vs_1];fw_xlate_cliside_server failed

and all my fwk.elg (on many many customers) is full of

[7 Nov 9:52:25][fw6_3];[vs_1];[2001:629:2600:666:1:5aff:fee4:7f89:48099 -> 2001:7fd::1:53] [ERROR]: malware_res_rep_match_dns_response: check_dns_response_activate() failed

[7 Nov 9:52:25][fw6_3];[vs_1];[2001:629:2600:666:1:5aff:fee4:7f89:48099 -> 2001:7fd::1:53] [ERROR]: cmik_loader_fw_context_match_cb: match_cb for CMI APP 10 - RESOURCE_REPUTATION failed on context 201, executing context 366 and adding the app to apps in exception

yes i know it its PRHF-35347 // https://support.checkpoint.com/results/sk/sk182606
the fix we received for it, didnt worked, still this message fill up fwk.elg
The info from the Knowledge base is really cool

"A missing attribute in the DNS server response causes a data failure."

so it means all DNS servers in the entire world are missing an attribute and only Check Point is aware of it?

VSX_Bernie · ‎2024-11-08

Hello Thomas,

Thank you for sharing.
I do not remember that we have ever seen any of the messages described.

I also tried prying into one the our VS's fwk.elg that has a rather high load - none of these messages.
I do remember that we have seen some weird "cosmetic" messages at times, but none like these.

We run VSX VSLS on OpenServer though - so perhaps many of these are Maestro specific?

I have to admit though - your issues sounds really quite serious.
Right about now I feel quite happy that we are not running Maestro.

Regarding the SK sk182606 - did you notice that it only says "Quantum Appliances" in the products?
I realize that it might just be an error in the article, but looking into other SK's I see there is a separate "Quantum Maestro" product.

Thank you for the remark about the DNS failure though - it made me laugh 🙂

Henrik_Noerr1 · ‎2024-10-24

Hey Jan,

We have the exact same experience - many times wondering why we are the first to get hit by this 'new issue'.

We have a very large environment based on many Lenovo Open Servers all running VSX.

Some issues we have seen - not all VSX specific;

- corexl dynamic balancing causing spontaneous reboots on appliance

- high load on large VSX clusters (sk181891)

- CPD using 100% cpu on gateways, destroying SIC, blocking any policy install (lasted for 3-4 months before a fix)

- changing funny ip range to a /20 causing all VS to lose ip addresses - that was a fun night 🙂

- very very long reboot times (better now in the newer jumbos)

- deleting an interface in SMC causes *another* interface to be deleted.

- installing policy causing high load with packet loss on VS (until another policy push is done)

- FEC causing interfaces not coming online

- VPNs stops working if passing another VS with securexl enabled.

- deleting (non monitored) vlans causing failovers

- Hit counter not returning correct values.

- running cpinfo causing reboots

Some issues mentioned was never seen again, some were folded into JHF, others we have private fixes for having continously portfixed to newer JHFs for I do not know how long and lastly some items we no longer perform - ie changing VSX Private IP range, we rather spin up a new VS or buy a new cluster than risking a full cluster down.

All of the above have in general eroded a lot of trust in the platform across the organisation.

/Henrik

VSX_Bernie · ‎2024-10-24

@Henrik_Noerr1 - You probably already have somthing for the FEC - but just in case:

We had the same issue when we tried to update to Take 110 of R81.10.
We found that the only viable solution was to change to FEC108 on the switches connecting GWs.

We have not had issues since.

AmitShmuel · ‎2024-10-30

Hi Henrik, Dynamic Balancing is not supported on Open Servers, how come it causes spontaneous reboots?

Henrik_Noerr1 · ‎2024-10-31

Hey,

This specific issue was on an appliance (6500)

/Henrik

AmitShmuel · ‎2024-10-31

I see. Is it still relevant? I am familiar with most past issues related to this feature and I've never heard of something like that. I'd be happy to assist if needed.

JozkoMrkvicka · ‎2024-10-30

VSX was selected as "future" due to the costs / flexibility / scalability. Upper management wanted to save some money, so logical move was migrate XY physical clusters into 1 VSX box with dozens of VSs. You dont need to maintain and pay support for XY physical clusters, just one VSX cluster. Great move, you say ...

Well, we regret that now. There were/are huge issues with VSX on R81.10 and R81.20. Some were integrated into JHFs after months of troubleshooting, some are still investigating and for some we got private portfixes.

One of the most ridiculous issue related to VSX is that according to the Release Notes for R81.10 and R81.20, maximum supported interfaces on VSX is 4096. This is proven to be wrong and in reality the maximum supported number of interfaces on VSX is 1023. Every interface (VLAN) on VSX which has index higher than 1023 gets only funny IP and not real cluster IP.

Kind regards,
Jozko Mrkvicka

VSX_Bernie · ‎2024-11-07

Hello Jozko,

Regarding the amount of interfaces - did you remember to change the size of the cluster private net, after configuring the VSX cluster object?:
vsx_util change_private_net

Default is a /22 - which would limit the subnet to 1024 addresses (not excluding network and broadcast).
When configuring a new cluster, we always change this to a /20 to accommodate for 4096 addresses.

If you have not changed this, I believe you are being limited by the size of the subnet - not by the number of interfaces allowed.

The statement that MAX is 4096 is correct - this is because it is a limit of the VLAN technology.
I do not think that VSX supports VXLAN yet - maybe it will in VSNext.

I have to say though - I never quite understood why the default would be /22, as this does not make sense to me.

JozkoMrkvicka · ‎2024-11-07

yes, IPv4 funny subnet was changed to 192.168.96.0/20 to allow configure maximum possible VLANs per VS (256 VLANs). More info in sk99121.

It doesnt matter if funny subnet is changed or not. If it is left by default as 192.168.196.0/22, you can create only 64 VLANs per VS. Other VS can have another maximum 64 VLANs. Funny IPs can be the same for 2 different VSs.

To configure 4096 VLANs if IPv4 funny subnet is left by default (192.168.196.0/22), you will need to configure 62 VSs each having maximum 64 VLANs.

To configure 4096 VLANs if IPv4 funny subnet is changed to 192.168.96.0/20, you will need to configure 15 VSs each having maximum 256 VLANs.

Try to configure more than 1024 VLANs on VSX and you will see that VLANs with interface indexes higher than 1023 wont work (no cluster VIP).

Kind regards,
Jozko Mrkvicka

VSX_Bernie · ‎2024-11-07

Hello Jozko,

Wow - this is a real eye opener - thank you for sharing this.
This with the 1024 interfaces is really concerning though - especially if there is no real documentation on this.

I think that perhaps we've (thankfully) never hit the 1024 limit - because I sure did not know.

VSX_Bernie · ‎2024-11-08

Hello Jozko,

I just thought of something.

How many GWs do you have in your VSX Cluster?

I just gave the provided SK a read-through.

I have shortened the following:

"

maximal number of interfaces supported by a VSX Gateway / VSX Cluster Member is limited to 4096 interfaces

"

In my experience - some times Check Point formulates their documentation erroneously.

It happens to the best of us.

So - I am just spit-balling here - what if the above is meant to say that an entire VSX Cluster is limited to 4096.

Meaning that 2 members would result in 2048 interfaces, and 4 members would result in 1024 interfaces.

JozkoMrkvicka · ‎2024-11-08

Only 2 members are part of VSX cluster.

You can configure more than 1024 VLANs on VSX, but without custom private portfix you will face issues.

Kind regards,
Jozko Mrkvicka

VSX_Bernie · ‎2024-11-08

Well received and understood.
Did you then receive a custom portfix, that enabled you to create more than 1024 VLAN interfaces?

JozkoMrkvicka · ‎2024-11-08

Yes.

Kind regards,
Jozko Mrkvicka

VSX_Bernie · ‎2024-11-10

Huh.
I am curious - what did TAC and R&D have to say about this?

I mean - as you said - the documentation clearly states 4096, and I can't really find any SK mentioning the 1024.
But giving you a portfix for this, is (for me at least) the same thing as admitting a fault/problem in the product.

I fully understand if you are not able to disclose this, but I am very curious.

JozkoMrkvicka · ‎2024-11-11

They admit known bug, but are not willing to integrate it into JHF for supported versions. Looks like not big problem for them as very little number of reports from the field. Maybe there are no customers using VSX and more than thousand VLANs.

Kind regards,
Jozko Mrkvicka

VSX_Bernie · ‎2024-11-12

Thank you for sharing Jozko.
This is quite unsettling however.

VSX was introduced - what like 21 years ago?
This has probably been an issue for a long time.

This is a prime example though - of everything all of us in this forum have been discussing.
Namely that it feels like VSX is down-prioritized.

Are you a member of CheckMates?

User Poll: Experience with VSX machines and R81.20