Cloudguard egress load balancing question
I am having difficulty understanding how traffic initiated from a VPC makes its way through Cloudguard to the Internet. My environment is Cloudguard IaaS in AWS, using Transit Gateway. Cloudguard is setup for auto-scaling and egress only, to provide outbound internet access to VPCs. We are running R80.40 on the gateways and the on-prem management server. The management station is reachable via AWS Direct Connect. With TGW we use BGP VPNs, etc.
[Client VPC]---[TGW]---[Security VPC with Cloudguard ASG]-->[Internet]
The minimum number of gateways in the autoscaling group is 3, scalable to 6. Each gateway performs NAT using the private eth0 IP, but it is also occurring at the instance level with the public IP associated with eth0:1, which is seen on the Internet.
These gateways are all independent. They don't share NAT IPs. Traffic passing through a gateway is NATed with a the unique public IP associated with eth0:1.
How can session stickiness be maintained for egress traffic through Cloudguard? We see a situation where a single host in the client VPC initiates two TCP connections to a single destination IP on the Internet. The server on the Internet expects these connections to all come from a single public IP. But the two sessions are being sent across different Cloudguard gateways, and therefore have different NAT IPs. The internet server rejects the second connection and the session breaks.
The ECMP routing 5-tupple hash on the TGW is ultimately responsible for the outbound distribution of traffic in this case.
If you need elasticity / auto-scale capabilities the GWLB might be an option moving forward and leverages an upstream NAT gateway in some topologies to avoid similar issues.
You can read more about it here further to the relevant AWS documentation:
Thank you Chris. It appears we should not have used Cloudguard in auto-scaling mode as an egress solution. Checkpoint's deployment guides (sk132552) don't actually depict ASGs for egress, but they don't discourage it either. In the on-prem world, I think it is reasonable to say that most Checkpoint firewall administrators would never setup multiple discrete firewalls (no clustering or state awareness between them) and spray connections across them with a 5-tupple hash, round-robin, or in any other random way. That is exactly what is happening here and there are applications that break as a result. We'll take a look look at the GWLB. Thanks again.