Re: What happens to long-term (long open) connecti...

Don_Paterson · ‎2023-09-21

What happens to long-term (long open) connections on a Scale In event where the connection/s are being handled by the gateway marked for termination, which is then terminated?

I checked the documentation and it is not clear to me:

This is what they have in there right now:

“Scale In

A scale in event occurs as a result of a decrease of the current load. When a scale in event triggers, Azure Autoscale designates one or more of the gateways as candidates for termination. The External Load Balancer stops forwarding new connections to these gateways, and Autoscale ends them. The Check Point Security Management Server detects that these CloudGuard Network Security Security Gateways are stopped and automatically deletes these gateways from its database.

Note - We recommend that you have at least two Security Gateways for redundancy and availability purposes.”

https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_VMSS_for_Azure/Content/Topics-Azure-V...

This is what I have sent in as feedback:
“This sentence does not seem to make sense:

" The External Load Balancer stops forwarding new connections to these gateways, and Autoscale ends them. "

It will help to understand the Azure and Check Point behaviour with regards to connections handling during Scale In events and deleted gateways.

One detail missing is handing of long-term connections by the deleted gateway and the connection possibly moving to another gateway where there is no synchronisation in the VMSS group.”

I wonder what is meant by "Autoscale ends them".

Can't test this now.

Any feedback or shared experience appreciated.

Don

PhoneBoy · ‎2023-09-21

The way I read this is: they die because the load balancer won’t forward the packets to a different gateway.
Even if the load balancer did, we don’t sync state information between gateways in this situation.

Don_Paterson · ‎2023-09-21

"we don’t sync state information between gateways in this situation" - Agreed. Done by design.

"they die because the load balancer won’t forward the packets to a different gateway."

Thoughts:

Azure seems to have some options which I need to look into.
This one does not seem to be well described. I can't find anything on it in their docs:
'Apply force delete to scale-in operations' (also see attachment/screenshot)

This one looks interesting but would is work for CloudGuard?

Terminate notification for Azure Virtual Machine Scale Set instances - Azure Virtual Machine Scale S...

Maybe there is an Azure VMSS best practice or ATRG that I have missed, or maybe they don't exist but they want to 😉

Nothing CloudGuard in here:

https://support.checkpoint.com/results/sk/sk111303

The Admin Guide has lots of useful info but the Scale In doesn't seem to have enough details.
https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_VMSS_for_Azure/Content/Topics-Azure-V...

Maybe the Scale in policy can be configured to satisfy the draining of connections for a limited time.

But it would be good to hear from R&D on this.

This is good info too:
Autoscaling guidance - Best practices for cloud applications | Microsoft Learn

Cheers,

Don

Bryan-Smith · ‎2023-09-22

Hi @Don_Paterson - TCP flows and connection draining are all based on the standard azure load balancer (az lb) healh probe function. For example, if the az lb health probe that is configured for the backend pool marks a gateway as unhealthy then the default TCP timeout is 60 seconds. UDP flows would immediately move to a healthy gateway.

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview

A probe down signal always allows TCP flows to continue until idle timeout or connection closure in a Standard Load Balancer.

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#probe-down...

In order to ensure a timely response is received, health probes have built-in timeouts. The following are the timeout durations for TCP and HTTP/S probes:

TCP probe timeout duration: 60 seconds
HTTP/S probe timeout duration: 30 seconds (60 seconds for establishing a connection)

Addtional TCP Flow timer information:

Azure Load Balancer has a 4 minutes to 100 minutes timeout range for Load Balancer rules, Outbound Rules, and Inbound NAT rules.

By default, it's set to 4 minutes. If a period of inactivity is longer than the timeout value, there's no guarantee that the TCP or HTTP session is maintained between the client and your cloud service.

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset

HTH

Don_Paterson · ‎2023-09-27

That is great info. thanks Bryan!

Don_Paterson · ‎2023-11-21

Adding a note here after feedback from R&D via Gil Frantsus:

"This is the information we received from RnD: The Azure Load Balancer does not support connection draining, which means that the connection will be lost, however, the Azure Application Gateway does support it.

The use of a Gateway Load Balancer is supported and mentioned in the Azure VMSS admin guide.

Refer to sk170304 for instructions on how to enable connection draining within CloudGuard Network Security. I added the SK to the Azure VMSS admin guide.

For Azure Application Gateway with connection draining support refer to https://learn.microsoft.com/en-us/azure/application-gateway/features."

Preview of sk170304:
"Solution

As of Autumn 2020, the Azure network load balancer from Microsoft does not support "connection draining", where the load balancer stops assigning connections to a node (for example, in preparation for maintenance or reboot).

If you would like this feature to be added to the Azure load balancer, contact Microsoft or your Microsoft partner and request it."

And gateway commands used during manual drain:

"fw tab -t connections -s
fw ctl get int cloud_balancer_port
fw ctl set int cloud_balancer_port 0
fw tab -t connections -s"

As always, refer to the SK for full details and new updates, and TAC for assistance, and of course Microsoft for draining feature where required.

Are you a member of CheckMates?

What happens to long-term (long open) connections on a scale in event.

“Scale In