Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
FrederikSuijs
Explorer

AWS Autoscaling TGW deployment with DirectConnect and On-Prem Mgmt

Intro:

Hello,

this is my first post on Checkmates. As a quick intro, I'm an IT professional with 15y of experience with a focus on Datacenter Network Design, SDN, Cloud-Native networking, Automation,... you can find me on LinkedIn https://www.linkedin.com/in/frederik-suijs-1102b8127/
During the last year I have been working on a CloudGuard IAAS solution both on Azure and AWS based on the Checkpoint blueprint https://www.checkpoint.com/downloads/product-related/check-point-secure-cloud-blueprint.pdf I have followed the general architecture and built the Hub-Spoke model with an inbound ASG for Ingress traffic, and an outbound ASG for egress + East-West traffic. For the connectivity to our On-Prem DC, we rely 100% on DirectConnect, which is directly attached to the Transit Gateway. From a HL perspective, all connections to and from On-Prem workloads should be considered as East-West traffic, so effectively the On-Prem DC can be seen as another Spoke.
To deploy my Hub-Spoke model, I have modelled the complete setup into a couple of Terraform scripts so I can easily create and destroy this in just a matter of minutes ( and I have done so for 20+ times at least ).

I will try to write down some of the though processes and indirect problems I have encountered and hope I can help somebody else with this post. To get the most out of this, you should have some basic knowledge of the architecture and idea behind the blueprint, good knowledge of Checkpoint Gateways, good understanding of AWS networking constructs ( vpc, TGW, DX, ... ) and some basic scripting affinity.

A lot of documentation for this design can be found online, through CheckMates, on YouTube... and a full configuration guide is available at https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_CloudGuard_AWS_Transit_Gateway/Conten... so I will not write a step-by-step guide on how to implement this design, but rather I will focus on what
I was still struggling a lot with during my own deployment, especially when combining this with an On Prem Mgmt Server and DirectConnect.

After a lot of troubleshooting and deep-dive sessions with some of my good friends at Checkpoint, I was able to pin-point and solve the issues for the following two major points:

  • Mgmt Server can not be behind VPN Tunnel
  • Source Hide NAT uses VTI

I will explain both in detail below.

On-Prem Mgmt Server behind VPN Tunnel:

When it comes down to an On-Prem Mgmt server, documentation is becoming a bit more fuzzy.
https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_CloudGuard_AWS_Transit_Gateway/Conten.... While there is a nice guide on how to configure this when using a VPN to connect to On-Prem, it didn't really tick all the boxes for me. For one; I'm using DirectConnect, and two; I'm treating my On-Prem connectivity as if it is a spoke.

Setup/Assumptions:
DirectConnect is properly configured.
On-Prem routing is established.
Checkpoint Gateways VPC + Subnets (in different AZ's) have been created.
Transit Gateway is created, Checkpoint VPC and DirectConnect are attached to it and proper Spoke(=On-Prem) + Checkpoint route tables + associations + propagations + tagging with x-chkp-vpn have been created.
Effectively IP connectivity is available between CP Mgmt Server and an ENI in the Checkpoint VPC.

Problem:
During the deployment using autoprov-cfg and following along with tail -f /var/log/CPcme/cme.log , you can get a good understanding of what is actually happening. In the discovery phase and initial policy push, everything is working fine. As soon as the next part, which will configure the VPN between TGW and the gateway, the problem occurs;

What I observe happening is the following, as soon as the VPN tunnels come up, BGP will start to negotiate and exchange routing. This is inherent to the design and is a prerequisite for it to work. The routing from On-Prem towards to Gateway itself is not the problem per se. Remember, we already assume proper routing is in place. This means the DirectConnect attachement is associated with the Spokes RouteTable, and there is a propagation of the Checkpoint VPC, so that route is fine.
For the return route however, a routing change occurs.
Without the VPNs+BGP, the routing table looks as follows:

 

gw-806eec> show route 
S 0.0.0.0/0 via 10.10.240.1, eth0, cost 0, age 515105 -> DG pointing to VPC/Subnet IP of AWS 
C 10.10.240.0/28 is directly connected, eth0 -> This is the CIDR of the subnet the Gateway is deployed in 
C 54.93.XXX.XXX/32 is directly connected, eth0 -> Secondary Public Interface 
C 127.0.0.0/8 is directly connected, lo

 

From the Gateway's perspective, traffic to the On-Prem Mgmt server,( which is running in some RFC1918 subnet ), will just be routed to it's default gateway.
In AWS this means it will reach the VPC router, and check that routing table. There is a 0.0.0.0/0 route towards the TGW. This has the Checkpoint VPC attachment, associated to the Checkpoint TGW RouteTable, where the DirectConnect routes are propagated, including the RFC1918 range. So untill now, return traffic can happily flow.

What happens after the VPN and BGP gets up and running:

 

gw-806eec> show route 
S 0.0.0.0/0 via 10.10.240.1, eth0, cost 0, age 515105 -> DG pointing to VPC/Subnet IP of AWS 
B 10.0.0.0/8 via 169.254.217.69, vpnt1000, cost None, age 267556 -> RFC1918 coming indirectly from DirectConnect 
B 10.10.240.0/27 via 169.254.217.69, vpnt1000, cost None, age 267556 -> VPC CIDR readvertised since propagated in Spokes RT 
C 10.10.240.0/28 is directly connected, eth0 -> This is the CIDR of the subnet the Gateway is deployed in 
C 54.93.XXX.XXX/32 is directly connected, eth0 -> Secondary Public Interface 
C 127.0.0.0/8 is directly connected, lo 
C 169.254.217.69/32 is directly connected, vpnt1000 -> Directly connected Tunnel Interface 
C 169.254.217.70/32 is directly connected, vpnt1000 -> Directly connected Tunnel Interface 
C 169.254.239.193/32 is directly connected, vpnt1001 -> Directly connected Tunnel Interface 
C 169.254.239.194/32 is directly connected, vpnt1001 -> Directly connected Tunnel Interface 
B 172.16.0.0/12 via 169.254.217.69, vpnt1000, cost None, age 267556 -> RFC1918 coming indirectly from DirectConnect 
B 192.168.0.0/16 via 169.254.217.69, vpnt1000, cost None, age 267556 -> RFC1918 coming indirectly from towards DirectConnect

 

We can see from the perspective of the gateway, the complete routing for RFC1918 flaps over from following it previous path (just following its DG), towards routing over the VPN's. While this is offcourse the intent since I deliberately want consider this traffic as if it is a spoke, as a side effect, also the routing towards the On Prem Mgmt server flaps over.
This is essentially causing the problem, since a CP Gateway doesn't like to be managed through a VPN tunnel terminated on itself... While technically this could work, I do realize this would create a bit of a cat vs mouse problem. What would happen if the VPN is down, then you can't fix this because you lost Mgmt connectivity...

Solution:
To solve the above issue, I used the possibility to deploy a custom script at deployment phase of a new Gateway with autoprov-cfg. See the "-cg CUSTOM GATEWAY SCRIPT" option:

 

[Expert@FW-MGMT-CP-A:0]# autoprov-cfg show all 
... 
templates: FW-HUB-AWS-EW-TEMPLATE: 
    custom-gateway-script: "/home/admin/FW-HUB-AWS-EW.sh" ...

 

This is a very simple script which configures a static /32 route ( or to your liking a route towards your On-Prem mgmt CIDR ), and points it towards the native VPC/Subnet DG.

 

#!/bin/bash
MGMT_IP="10.0.9.4/32"
GATEWAY_IP="$(ip route | awk 'NR==1{print $3}')"
echo "Setting static-route for $MGMT_IP to $GATEWAY_IP"
clish -c "set static-route $MGMT_IP nexthop gateway address $GATEWAY_IP on"

 

 

Source Hide NAT uses VTI:

Inherent to the design of the ASG setup, all traffic passing a Gateway must be NAT, to enforce symmetric return traffic. Again the documentation on how to set this up is there, but doesn't cover all use-cases. Section 4B Add a Manual NAT rule that hides all internet sources behind the Transit Gateways https://sc1.checkpoint.com/documents/IaaS/WebAdminGuides/EN/CP_CloudGuard_AWS_Transit_Gateway/Conten...

Setup/Assumptions:
A similar setup as above is used
A general Hide NAT rule is applied at the end of the NAT Policy deployed on the ASG Gateways.
 
Problem:
When configuring the NAT as described in the configuration guide, a gateway will use it's outgoing interface based on it's routing table, as the IP address to Hide NAT behind.
For traffic initiated On-Prem, destined towards Spokes, this poses no problem. Also traffic between Spokes, or initiated within a spoke, towards the Internet is working fine. However, traffic initated inside of a spoke, destined towards On-Prem, is suffering from this issue.
I'll explain in detail by following a packet: Ping from VM in Spoke VPC -> Some VM On Prem

Spoke VM -> Spoke VPC Route Table -> TGW -> TGW Spokes Route Table -> ECMP VPN to Gateway -> Gateway Policy + Routing Table + Hide NAT (behind outgoing interface) -> VPN to TGW -> TGW Checkpoint Route Table -> Direct Connect Propagation -> On Prem Routing -> Destination VM
 
At this point the Destination VM will just try to reply, and technically there is nothing preventing it from doing that. The destination IP address it will use is critical in this flow, since it is the Hide NAT address used by the Gateway. Since it was based on its outgoing iface, it will be the VPN Tunnel Interface IP. In the case of AWS, this somewhere in the range of 169.254.x.x/16 see https://docs.aws.amazon.com/vpn/latest/s2svpn/VPNTunnels.html This should ring a bell to most you and some flags should be raised when using this specific CIDR. In most networks this cannot be routed ( eg. Azure ... ), as was the case for me.
 
 Solution:
A couple of solutions are possible, but I opted for the following:

In the NAT policy, I created an extra rule to Hide NAT all RFC1918 traffic behind a Dynamic Object called AwsLocalGatewayNat
Before that rule, I create a no-NAT rule, to avoid traffic towards the Mgmt server to be NATted.
The last NAT rule can remain, making sure traffic to Internet is still natted behind it's routed interface.

After the gateways have fully deployed, along with the VPN etc... ( see problem 1 above ), I run a custom script trough the SmartConsole Script Repository which basically does the following:

 

0. Calculate a new IP address
1. Create a dynamic object on the gateway itself refering to this IP
2. Create a static blackhole route towards the IP
3. Redistribute the static route into BGP

 

ORIG_IP="$(clish -c "show interface eth0" | grep ipv4-address | awk '{print $2}' | cut -d "/" -f 1)"
echo $ORIG_IP
FIRST_DIGITS="$(echo $ORIG_IP | cut -d "." -f 1-3)"
echo $FIRST_DIGITS
LAST_DIGIT="$(echo $ORIG_IP | cut -d "." -f 4)"
echo $LAST_DIGIT
echo $[LAST_DIGIT+32]
echo $FIRST_DIGITS.$[LAST_DIGIT+32]
NAT_IP=$FIRST_DIGITS.$[LAST_DIGIT+32]
echo $NAT_IP
dynamic_objects -n AwsLocalGatewayNat -r $NAT_IP $NAT_IP -a
clish -c "set static-route $NAT_IP/32 nexthop blackhole"
clish -c "set route-redistribution to bgp-as 64512 from static-route $NAT_IP/32 on"
clish -c "set routemap ex-0/0-1 id 10 match network $NAT_IP/32 exact"
clish -c "set routemap ex-0/0-2 id 10 match network $NAT_IP/32 exact"

 

 
This solution is not perfect, since it requires a manual intervention after every deployment or scale-out event, but it works for our environment.
 
Summary:
I hope you made it all the way untill the end and could learn something from my experiences. Feel free to comment on possible improvents, if you ran into similar issues and how you solved it or just to give some kudos.
Finally I want to thank @Geert_De_Ron and @Jonathan_Lebowi  again for their excellent support and willingness to help out. Let me leave you with an excellent presentation by @Jonathan_Lebowi touching upon some of these topics https://www.brighttalk.com/webcast/16731/400673
Another great source of info was the github by @Arnfinn_Strand https://github.com/arnstran/CHKP-AWS_TGW
1 Reply
Geert_De_Ron
Employee
Employee

Thanks for this great sharing of knowledge @FrederikSuijs !!

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.