How to do capacity/load testing of Active-Active a...

Kjohnt006 · ‎2024-10-17

Hi Team,

We have a pair of Checkpoint firewalls based out of Azure configured in active-active set up. Our customer wants to have load/capacity testing of the same to assure if any one firewall goes down, other will be capable to process traffic. I have few concerns regarding the same:

1> How do we perform this testing? can we shutdown one azure instance and see how other behaves?

2> Is there any risk involve? is it recommended to perform this?

3> Any details we need to check before going for this task?

Please if anyone can guide on it.

PhoneBoy · ‎2024-10-17

Version/JHF?
What template did you deploy this from or what specific instructions did you follow?
Active/Active on cloud usually involves load balancers and auto-scaling.

Don_Paterson · ‎2024-10-17

You definitely need to get all of the details of the deployment before going ahead with anything along those line, if you can.

You can check these two files for details of template/deployment:

/etc/cloud-version

/etc/cloud-version.json

You need to know the deployment type, for example: vWAN, GWLB, HA or VMSS.

These are two references you can check.

https://www.checkpoint.com/downloads/products/cloudguard-gateway-performance-for-microsoft-azure-dat...

https://www.checkpoint.com/downloads/products/cloudguard-architecture-blueprint-diagrams.pdf

If it is a HA cluster then only HA (Active-Standby) is supported with no more than 2 instances.

If there is an ongoing heavy load for more than 5 minutes and it is a CloudGuard Scale Set with 2 instances (and the default scale out configurations are applied) then a 3rd CloudGuard instance will be deployed to support the increased workload.

If the load continues to increase after the third gateway is deployed then a 4th instance will be begin to be deployed, and so on, until the maximum allowed instance count is reached or the load drops (which will start the scale in events (after 5 minutes more)).

If the instance maximum and default count is changed to 1 then the one remaining instance will continue to handle it's connections and any new ones.

The other (deleted) instance/s connections will likely be lost when they are directed to the remaining instance.

The VMSS instances do not sync (it is not a ClusterXL cluster) and the network load balancer will steer all connections to the remaining instance, which is likely to drop all connections that are not new and that were previously handled on the other CloudGuard gateway instance/s.

There is a procedure to simulate high CPU usage and trigger a scale out event.
https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_VMSS_for_Azure/Content/Topics-Azure-V...

In an ideal scenario these details are know and considered (along with region choices, known limitations and more) before the deployment.
In the real world, the cloud is a journey, for sure.

https://azure.microsoft.com/en-us/solutions/cloud-enablement/well-architected

Azure has their own limits that need to be considered. This link is in the PDF above:
https://learn.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput

"Flow limits and active connections recommendations

Today, the Azure networking stack supports 1M total flows (500k inbound and 500k outbound) for a VM. Total active connections handled by a VM in different scenarios are as follows.

VMs that belong to a virtual network can handle 500k active connections for all VM sizes with 500k active flows in each direction.
VMs with NVAs such as gateway, proxy, firewall can handle 250k active connections with 500k active flows in each direction due to the forwarding and more new flow creation on new connection setup to the next hop as shown in the above diagram.

Once this limit is hit, other connections are dropped. "

Are you a member of CheckMates?

How to do capacity/load testing of Active-Active azure firewalls