Re: Questions about ElasticXL in production real c...

Tchangoloro

Hello guys, Im currently working on a project where the customer runs Fortinet in HA Active/Active, with one firewall member deployed in each of two physically separate datacenters.

As part of a migration assessment to Check Point, I’d like to hear from those who are running ElasticXL in production environments:

How has stability and reliability been in real-world usage?

How is the day-to-day operation and troubleshooting compared to traditional ClusterXL Active/Active?

Any key caveats or limitations observed in customer environments?

Looking for practical, production-level feedback.

Bob_Zimmerman

Operations are basically the same as an ordinary ClusterXL cluster plus a cloning group. The only extra complication is that you always connect to the SMO, then connect from there to the specific member you want in order to dump tables or run debugs. Packet captures on the pivot member work, they just might look weird for traffic handled by other members. Debugs are a bit of a headache, since you can't really predict which member will handle a connection. You mostly end up running them on all members, collecting them, then throwing out what you don't need.

The internal connection to go from one member to another is SSH authenticated with unencrypted key in /home/admin/.ssh/id_rsa{,.pub} with a unique key on each member added to all members' authorized_keys for the user named 'admin'. I haven't yet received an answer about how we can rotate the keys when an admin leaves the company. It's also a little weird to see RSA keys in use when ed25519 is right there.

Updates are a headache. CPUSE runs in a special mode which copies files between members for you, but it also runs actions on all members at the same time. This means if you tell it to install a jumbo, you get a hard outage while the jumbo installs on all members at once. You can tell it to install only on member X, but that's not the default, and it makes updates more manual.

The cloning group and lightshot replication keeps the members pretty strongly synchronized. It's still technically possible to do dumb things like adding a route on only one member, but it's not the default way of operating. clish warns you if you're working on only the local member instead of the whole cloning group.

Tchangoloro

Thanks you reply... Based on the operational challenges you mentioned, would you choose ElasticXL over traditional ClusterXL Active/Active in production? In which scenarios does ElasticXL clearly make sense, and when would you avoid it?

Do the distributed debugging and update behaviors (unpredictable member handling, running debugs on all members, CPUSE updating multiple members at once) significantly impact MTTR or operational simplicity?

For environments that don’t require massive scale, would you generally recommend ElasticXL or ClusterXL, and why?

Bob_Zimmerman

First the good news: I'm confident the update thing can be worked around. I'm building a proper update script which I can run via CDT, but I don't have enough 3600 units to fully test it yet. Once I get a few more and can prove I handle more than two members correctly, updating a cluster without an outage should be a simple command on the management.

Then the bad news: debugging isn't likely to get much simpler, at least not for a while. I rarely need anything heavier than a wide-open drop debug to solve most issues, but collecting all the output is going to remain annoying. Not that much more annoying than on any other load-sharing cluster mode, I guess.

I think I would recommend people new to Check Point start with ElasticXL because I've dealt with a ridiculous number of outages attributable to the differences between the members' configs. Somebody adds a proxy ARP entry on one member but not the other (or with the wrong MAC on the other), adds a route only to one member, builds a new subinterface, but the members have a different netmask ... ElasticXL avoids all of that. When stuff goes wrong, it's a little more annoying to get information, but there are fewer opportunities for things to go wrong.

Just don't use the interface named Mgmt for anything after setup.

Are you a member of CheckMates?

Questions about ElasticXL in production real customer experience