Re: VSX: Limitation of number of routes

Christian_Riede · ‎2020-11-18

Hello,

after having followed several paths without any success, here an open letter to Checkpoint.

I have a confusing situation about routes on a 23800 based VSX.

Currently, we have about 2400 routes on our various perimeter firewall systems. Basically that's the number of networks on the internal side of the firewall. Some of these routes probably could be aggregated, but as these routes are generated by automation scripts, this would bring extra complexity into the scripts that I want to avoid.

From my point of view, 2400 Routes is nothing. A linux operating system can handle that without any problems. In fact, I have operated Linux systems with 400.000 plus routes without any problems. Also the Checkpoint VSX gateway itself can run with out 2400 routes without any problem according to our experience. On the various ClusterXL gateways, that we also have, even if running on very small appliances, also everything is fine with 2400 routes.

But there is a big difference between ClusterXL and VSX, when it comes to changing the routing table. On ClusterXL, it's finished within seconds, even if changing a large number of routes. On VSX, the situation is completely different. Changing a single route on VSX takes 5 minutes on some of our VSX cluster and even more on other clusters. That time is per Virtual system, so if we deploy the same route to several virtual systems, it's multiple times 5min. Also it does not matter if we do that change via SmartConsole or via vsx_provisioning command line tool. At the end, adding a single network on the internal side takes hours of routing changes on the several VSX systems. That's a totally crazy situation: We buy the biggest non modular appliances you can have from checkpoint and adding a single routes takes hours.

The reason for this bad performance is not in the firewall, but in the Smartcenter. While updating the routes, we see 100% cpu utilization the fwm process, which brings the smartcenter almost to a halt for the users connected via SmartConsole. Also, new logins fail because Authentication (Job for fwm..) is timing out because fwm is constantly busy with the routing update.

There is a feature called "route propagation", where a VSX can propagate route to other VSes. As there are slight differnces in the routing tables of our gateways, we cannot use this and our automation script also can guarantee consistency, so there is no need for that also.

We had a ticket on that. Result from R&D: It is like it is, use fewer routes.

We had a chat with out pre sales SE. Answer from R&D until now: It is like it is, use fewer routes.

There is sk167353. The symptoms match. According to the sk, you can get a patch that accelerates things. When I asked for the patch, I got asked:

The fix is relevant in a topology where multiple VSs are connected to a single Virtual Switch. Is this the case here?
When the fix is installed, route propagation to VSs that are connected to the same Virtual Switch does not take place when pushing configuration, thus saving processing time. Will this be acceptable you

My answer was yes to both questions. I am now waiting for the fix - lets see what it brings. But: I was told by Diamond support that this fix will probably never make it into a Jumbo. Welcome back to the private hotfix hell. Been there, done that, don't want it back.

Missing official informations from Checkpoint, let me guess, what's going on here on a technical level: When implementing the Route Propagation feature, the programmer made a extremly inefficient choice of algorithms to compare and update the routing tables of the virtual systems connected to a virtual switch when updating routing on the of the connected virtual systems. Probably the complexity of the algorithm being used is at least quadratic in the absolute number of routes already existing. Also, these calculations are done even if route propagation is not used. The patch probably disables the route propagation feature completely and the performance impact goes away.

So now, what do I expect from checkpoint:

A clear communication: What is the maximum number of routes supported on a virtual system on VSX? Imho, a six figure number should be the target here.
What implementation times for route changes are to be expected on decent hardware?
Is my guess about the root cause of the problem correct? Otherwise, what is the root cause?
Please urgently implement route propagation efficiently. This is probably possible with complexity of n log n instead of n^2, which is a big difference. I have written scripts to sync our IP address management to checkpoint routing table. Believe me, I know what I am talking about.
If the route propagation feature is not used, these checks should be disabled automatically and completely. And that change needs to go into the Jumbo, not in a private hotfix that is not intented to go into the Jumbo.

Any comments on that from Checkpoint? How should we proceed? The current situation where a single route update blocks our SmartCenter for hours is not acceptable and we need a clear statement how to proceed.

Any Comments from other customers? Do you experience the same? How do you deal with that situation?

Regards, Christian Riede

_Val_ · ‎2020-11-18

@Christian_Riede thanks a lot for your elaborate and professional write-up.

There is a lot of "magic" happening on on the management side, when VSX is being managed. VSX objects, unlike regular clusters, are not managed as a single object in a single domain. Instead, each parameter of a certain VS is treated as a "VS slot" and is spread between both "main" and "target" domains on the management.

When making a single change, you instigate a cascade of changes, resulting in a provisioning commands being sent to VSX physical GW from the "main" domain, to get those changes executed for a certain VS.

You are right, thousands of static routes in a regular Unix/Linux environment is not a problem. And even on the VSX GW, it is not an issue to maintain those routes spread through a bunch of VRFs (which are your VSs).

The issue here is making a change, if I understand correctly. Complexity grows with amount of VSs and amount of routes, exponentially in each case. With my VSX customers, I have had a couple of similar cases, with very complex VSX architectures resulting in a nightmare of change management.

Like in your case, they were using VSX in with Virtual Routers. It took me years in one of these cases to convince them to move to VSLS architecture and push routing complexity to the network side, where it can be managed much more effectively. I understand this may not be an answer you are looking for. Still, I believe, pushing complexity to networking side is the right approach, although it will require an effort to do that.

On the TAC related side, we will ask R&D to look into this. I believe, this is a very extreme case you are facing, but let us check internally if something can be done.

Christian_Riede · ‎2020-11-18

Yes, making a change is the problem here.

We are on VSLS. We are not using virtual routers.

Anything else we can do to ease the situation?

Cannot think how 2400 routes can be extreme. If managing 2400 Routes is a problem for VSX, then VSX is the wrong product for this purpose. Is it that what you want to say? For me, this setup is very basic. Perimeter-Firewall with some DMZs and routes to all the internal networks on the internal side. Not having supernet routes 10/8, 192.168/16 172.16/12 was a recommendation from Checkpoint PS and from a networking and security perspective I understand and support that recommendation.

Would be interesting to learn how many routes people have on their firewalls. Checkpoint-Customers, please comment here: How many Routes do you have?

JanVC · ‎2020-11-18

There is sk167353. The symptoms match. According to the sk, you can get a patch that accelerates things. When I asked for the patch, I got asked:

The fix is relevant in a topology where multiple VSs are connected to a single Virtual Switch. Is this the case here?
When the fix is installed, route propagation to VSs that are connected to the same Virtual Switch does not take place when pushing configuration, thus saving processing time. Will this be acceptable you
My answer was yes to both questions. I am now waiting for the fix - lets see what it brings. But: I was told by Diamond support that this fix will probably never make it into a Jumbo. Welcome back to the private hotfix hell. Been there, done that, don't want it back.

we bumped into the same issue in our VSX/MDS environment, however it was more extreme
changing 1 route/interface for 1 VS could take anywhere between 30 to 45 minutes
strangely the issue only appeared after upgrading the MDS from R77.30 to R80.20 (even after upgrading to R80.30 the issues persisted)

a TAC case and one month later the verdict was as now is described in sk167353, install a custom hotfix which breaks route propagation

luckily for us we don't use route propagation
but we are now in private hotfix hell where we need to ask a portfix everytime we install a new jumbo on the MDS

I don't have exact figures on the total amount of routes we have, but my estimate is 1000+ scattered over 50+ VS

seeing that you only need 5 minutes to push the routes for one vs, i don't have high hopes that the custom hotfix will help you in any way

_Val_ · ‎2020-11-19

I have probably misread your post. My assumption was, you use route propagation, which only makes sense in VR mode.

So, those 2400 networks, they are not directly attached to your VSs, are they?

Christian_Riede · ‎2020-11-19

Hello Val,

to your Question: 95% of the 2400 routes are networks on the internal side of the firewall reachable via the big core switches in the datacenter. Some of them are in DMZs, also not directly attached. Directly attached are lets say about 5-20 networks per firewall.

_Val_ · ‎2020-11-19

That is what I thought. Push routing to DC networking side then, can you?

Alex- · ‎2020-11-18

I've once been faced with a similar case where a lot of VS needed to know 100's of subnets in an evolving network which couldn't be easily summarized on top of this, so change management was an ongoing challenge either with the Console or the VSX provisioning tool.

As a result, the decision was made to offload routing to a dynamic protocol (BGP in this case) and it runs like clockwork.

Christian_Riede · ‎2020-11-18

Yes, I have also thought about using dynamic routing. The big disadvantage is that if the routing protocol breaks down for some reason, all routes are lost until the routing protocol can reestablish the context/session/whatever and redistribute all routes. From the stablility side I see a big advantage of using static routes here. The routing table here is not very dynamic, changes are seldom (we typically do weekly updates of 1-5 routes and this job runs for hours) and from our IP address management I have a very clear overview on the needed routes and can easily push them via automation scripts. Also, not in all cases we have a device capable of routing protocols directly connected to the firewall.

Still cannot see why 2400 routes on a handful of VSes should be a problem for the "magic" of VSX if implemented properly.

_Val_ · ‎2020-11-19

Or, even better, in case of static routes, just push those routing decision to adjacent network routers before and after VSX cluster.

Christian_Riede · ‎2020-11-19

OK, now what happens when someone in the LAN does a network scan on nonexisting addresses? Then the internal router forwards it via the default route to the firewall and the firewall routes it back via the 10.0.0.0/8 route to the core switch. Bad idea. And no, Anti spoofing does not help here.

_Val_ · ‎2020-11-19

How is it different from the current case? What do you actually gain to make a complex routing decision on VS instead of an adjacent router? Unless you have any-any-accept policies all around your FWs, most of unwanted traffic should be filtered out anyway, regardless of your routing instance.

Also, it may seem to you personally a bad idea, but in this specific case, I am talking about a working and proven by many years in production architecture, implemented by multiple customers over ten plus years. Just give me a tiny bit of trust here, please.

Chris_Atkinson · ‎2020-11-19

Sink holing routes can be done on a router, seen many an environment where unused networks are routed to Null0 to avoid the scenario described.

CCSM R77/R80/ELITE

Christian_Riede · ‎2020-11-19

Also, I restate my first two questions:

What is the maximum number of routes supported on a virtual system on VSX?
What implementation times for route changes are to be expected on decent hardware?

_Val_ · ‎2020-11-19

@Christian_Riede I though we discussed that already. Not an official answer, but I will try again.

There is no hard coded limit. You are definitely pushing the boundaries though, and I thought the goal of the discussion is to help you out with the situation in hands. If not the case, apologies.

Are you a member of CheckMates?

VSX: Limitation of number of routes