Hello,
after having followed several paths without any success, here an open letter to Checkpoint.
I have a confusing situation about routes on a 23800 based VSX.
Currently, we have about 2400 routes on our various perimeter firewall systems. Basically that's the number of networks on the internal side of the firewall. Some of these routes probably could be aggregated, but as these routes are generated by automation scripts, this would bring extra complexity into the scripts that I want to avoid.
From my point of view, 2400 Routes is nothing. A linux operating system can handle that without any problems. In fact, I have operated Linux systems with 400.000 plus routes without any problems. Also the Checkpoint VSX gateway itself can run with out 2400 routes without any problem according to our experience. On the various ClusterXL gateways, that we also have, even if running on very small appliances, also everything is fine with 2400 routes.
But there is a big difference between ClusterXL and VSX, when it comes to changing the routing table. On ClusterXL, it's finished within seconds, even if changing a large number of routes. On VSX, the situation is completely different. Changing a single route on VSX takes 5 minutes on some of our VSX cluster and even more on other clusters. That time is per Virtual system, so if we deploy the same route to several virtual systems, it's multiple times 5min. Also it does not matter if we do that change via SmartConsole or via vsx_provisioning command line tool. At the end, adding a single network on the internal side takes hours of routing changes on the several VSX systems. That's a totally crazy situation: We buy the biggest non modular appliances you can have from checkpoint and adding a single routes takes hours.
The reason for this bad performance is not in the firewall, but in the Smartcenter. While updating the routes, we see 100% cpu utilization the fwm process, which brings the smartcenter almost to a halt for the users connected via SmartConsole. Also, new logins fail because Authentication (Job for fwm..) is timing out because fwm is constantly busy with the routing update.
There is a feature called "route propagation", where a VSX can propagate route to other VSes. As there are slight differnces in the routing tables of our gateways, we cannot use this and our automation script also can guarantee consistency, so there is no need for that also.
We had a ticket on that. Result from R&D: It is like it is, use fewer routes.
We had a chat with out pre sales SE. Answer from R&D until now: It is like it is, use fewer routes.
There is sk167353. The symptoms match. According to the sk, you can get a patch that accelerates things. When I asked for the patch, I got asked:
- The fix is relevant in a topology where multiple VSs are connected to a single Virtual Switch. Is this the case here?
- When the fix is installed, route propagation to VSs that are connected to the same Virtual Switch does not take place when pushing configuration, thus saving processing time. Will this be acceptable you
My answer was yes to both questions. I am now waiting for the fix - lets see what it brings. But: I was told by Diamond support that this fix will probably never make it into a Jumbo. Welcome back to the private hotfix hell. Been there, done that, don't want it back.
Missing official informations from Checkpoint, let me guess, what's going on here on a technical level: When implementing the Route Propagation feature, the programmer made a extremly inefficient choice of algorithms to compare and update the routing tables of the virtual systems connected to a virtual switch when updating routing on the of the connected virtual systems. Probably the complexity of the algorithm being used is at least quadratic in the absolute number of routes already existing. Also, these calculations are done even if route propagation is not used. The patch probably disables the route propagation feature completely and the performance impact goes away.
So now, what do I expect from checkpoint:
- A clear communication: What is the maximum number of routes supported on a virtual system on VSX? Imho, a six figure number should be the target here.
- What implementation times for route changes are to be expected on decent hardware?
- Is my guess about the root cause of the problem correct? Otherwise, what is the root cause?
- Please urgently implement route propagation efficiently. This is probably possible with complexity of n log n instead of n^2, which is a big difference. I have written scripts to sync our IP address management to checkpoint routing table. Believe me, I know what I am talking about.
- If the route propagation feature is not used, these checks should be disabled automatically and completely. And that change needs to go into the Jumbo, not in a private hotfix that is not intented to go into the Jumbo.
Any comments on that from Checkpoint? How should we proceed? The current situation where a single route update blocks our SmartCenter for hours is not acceptable and we need a clear statement how to proceed.
Any Comments from other customers? Do you experience the same? How do you deal with that situation?
Regards, Christian Riede