VPN issue with IKEv2 and Cisco ASA

Tiago_Cerqueira · ‎2019-10-11

Hi,

Last week we upgraded our security gateway from R77.30 to R80.20. After this upgrade, we lost connectivity with one of our VPNs. This VPN is with a third party gateway, a Cisco ASA and we are using IKEv2.

The issue is weird and I've isolated the following things:

1)If the negotiation is triggered on the ASA side, everything works as expected (so, as a workaround, they are bouncing the tunnel on their side, generating traffic to us (if we are the first to generate traffic it won't work) and that's allowing us to connect)

2)If we initiate the connection, we are unable to reach the other side of the VPN but, they are able to reach our network. So traffic generated on their side of the VPN always reaches us without issues.

3)Child SAs are only being negotiated on re-keys, I'm assuming the first time they are created is under the AUTH packet, as per the RFC.

I have a case opened with TAC, but so far no meaningful replies. I can also share the vpnd.elg files, as well as the ikev2.xmll files if you are interested in taking a look at that.

Thanks

Timothy_Hall · ‎2019-10-11

Two guesses:

1) You had a custom subnet_for_range_and_peer directive defined in the $FWDIR/conf/user.def.R77CMP file on your SMS, and when the gateway was upgraded to R80.10+ this file no longer applied. Any special directives in the old file need to be copied to the $FWDIR/conf/user.def.FW1 file on the SMS and policy reinstalled to apply to the new gateway version. sk98239: Location of 'user.def' files on Security Management Server

2) You had a custom kernel definition affecting the VPN in the $FWDIR/boot/modules/fwkern.conf, $FWDIR/boot/modules/vpnkern.conf or $PPKDIR/boot/conf/simkern.conf file(s) on the upgraded gateway itself that did not survive the upgrade process.

If it is neither of those things, try disabling SecureXL VPN acceleration for that peer and see if it impacts the issue: sk151114: "fwaccel off" does not affect disabling acceleration of VPN tunnels in R80.20 and above

Also watch out for sk116776: Instability issues in VPN Tunnel with Cisco using IKEv2

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-10-12

Hi Timothy,

So, you definitively have something there... When this tunnel was created, an entry was indeed added to the user.def file. However, this was done in a different file location than the one mentioned on sk98239. We have an MDS, but according to the SK the file shouldn't be defined here. The subnet_for_range_and_peer was defined under /var/opt/CPmds-R80.20/conf/user.def.R77CMP. I have since tried to remove this entry an add it to the correct location, under /opt/CPmds-R80.20/customers/<CMA-NAME>/CPsuite-R80.20/fw1/conf/user.def.FW1, installed policy but no success. I've also tried to add this entry under /var/opt/CPmds-R80.20/conf/user.def.FW1, also without success. I did run the fw tab -t subnet_for_range_and_peer which shows the correct entry for this VPN on the gateway after installing policy, however, I was still experiencing the same issues.

I've tried disabled fwaccel as well as vpn accel without success. As far as custom kernel definitions go, I checked and couldn't find any...

I believe that the issue causing this is related with the user.def files. I'll redirect the support to that, but if you could provide some insight as to why this is still happening, despite the fact that I've moved the definitions, I'd appreciate that.

Thanks!

Timothy_Hall · ‎2019-10-12

I'm assuming for VPN Tunnel Sharing in the community settings you have it set to "one tunnel per subnet". As a test try setting it to "one tunnel per pair of hosts" and reinstalling policy. If the problem goes away you have confirmed that it is indeed a subnet/selector issue and not something else. In general it is not a good idea to leave it set to "pair of hosts" as a large number of IPSec tunnels can be generated.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-10-15

I agree. And if I recall correctly, we tried that when we were first setting up the tunnel and it worked.

I'll raise a change to test that but, as you've said, this is not an ideal solution. If this works, what can we do to use "one tunnel per subnet"?

Thanks

Timothy_Hall · ‎2019-10-15

You'll probably need to work with TAC and figure out why your subnet-per-peer directive is not working properly as that should definitely work with IKEv2. Because the directive is showing up on the gateway's tables, it sounds like you have it defined in the correct user.def* instance on the MDS/SMS/Domain.

You can use "pair of hosts" permanently, but only if you have just a few hosts on each end that need to use the tunnel *and* the Firewall/Network Policy Layer is sufficiently locked down to prevent large amount of tunnels from starting. With "pair of hosts" a separate IPSec/Phase 2 tunnel is started for every combination of host IP addresses (/32's) that are allowed to communicate. So if two Class C networks are using the tunnel and the rulebase allows the entirety of the networks to communicate with each other, in theory over 65,000 separate tunnels could try to start which will quickly bang against the soft limit of 10,000 concurrent tunnels and cause intermittent VPN connectivity. If PFS is enabled a separate computationally-expensive Diffie Hellman calculation will occur for each and every IPSec/Phase 2 tunnel which will cause a massive amount of firewall CPU overhead and further problems.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-10-15

I'm already working with TAC on that. I'll post updates here once I have them. Do you have any ideas on how to troubleshoot this? I guess I could run a kernel debug and checkout vpnd.elg after correcting the user.def file, and maybe see if I'm missing something.

Thanks for you help so far!

Timothy_Hall · ‎2019-10-15

All IKE negotiations take place in process space via vpnd on the firewall, so you'll need to debug vpnd (vpnd.elg) and probably turn on IKE debugging which is output to ikev2.xmll. I don't think you'll need to perform kernel-level debugging for this issue, at least not initially.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-10-21

Hi,

So, we've isolated the issue. Apparently the ASA was erroneously detecting the need to use NAT-T during the IKE_INIT phase, when we started the communication. My guess is that, when the ASA initiated the communication, it did so by negotiating NAT-T with us (the checkpoint is configured to support NAT-T) and that would establish the tunnel successfully and allow communication.

The ASA was on version 9.8, for future reference.

Timothy_Hall · ‎2019-11-21

Yep just saw this with a customer that upgraded from R80.10 to R80.30 and transitioned from a single 4600 to a ClusterXL cluster of 5400s with R80.30 JHFA 50. Everything worked after the upgrade, except a domain-based site-to-site VPN to a Cisco ASA using IKEv2. Using ikeview we could see that when the Check Point was initiating Phase 1 would complete, but when the Check Point sent the Auth packet with the Traffic Selectors and such...no response from the Cisco. So the Check Point just kept sending the Auth over and over again. vpn accel off had no effect on the issue.

After some lengthy debugging on the Cisco side we found out that the Cisco was determining that NAT-T needed to be used, which is simply wrong as we double-checked and triple-checked there was no NAT between the two peers. The Auth packet was being silently dropped by the Cisco since it was expecting it to come in on UDP 4500 instead of UDP 500. Once we set force_nat_t to true via GuiDBedit for the Check Point cluster object the tunnel came up and worked normally.

This discovery led to a spirited discussion between myself and the Cisco administrator, as he insisted that nothing had changed on his end (which is true), but he took offense when I said the Cisco was "erroneously" starting NAT-T (which is also true). Clearly Check Point is doing something different in IKEv2 between R80.10 and R80.30 that is tripping up the Cisco ASA in regards to NAT-T; I couldn't see anything that would cause a peer gateway to determine NAT-T was required. The Peer ID IP address and source IP address on the IKE packets matched exactly.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-11-21

@Timothy_Hall wrote:
Clearly Check Point is doing something different in the IKEv2 Auth packet between R80.10 and R80.30 that is tripping up the Cisco ASA in regards to NAT-T; I looked at every bit in the Auth packet and couldn't see anything that would cause a peer gateway to determine NAT-T was required. The Peer ID IP address and source IP address on the IKE packets matched exactly.

My thoughts exactly and I cranked my head around it because I had some PCAPs and the IKEv2 and NAT-T RFCs side by side and couldn't figure out what CheckPoint was doing for the ASA to detect it as a NAT-T device.

Do you have the ASA OS version? I read somewhere there was a bug on a 10.something release regarding NAT-T detection, and I believe my peer was on that release or in a subsequent one

Timothy_Hall · ‎2019-11-21

Never caught the Cisco device version, was a strange problem to be sure.

Was the firewall you experienced it on part of a HA cluster? That was one other thing that changed in our case other than the code version from R80.10 to R80.30.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Tiago_Cerqueira · ‎2019-11-22

Yes, it was also an HA cluster. But in our case it was a cluster upgrade, and because we kept it as a cluster, I don't know if it could be something related with clusterXL. I've been meaning to test this on a lab environment but haven't gotten around to doing it, unfortunately.

Luis_S · ‎2020-01-07

Hello, Tiago and all. Thank you for this post to the forum, since it was very helpful today.

I had the exact same issue today. Nothing helped in Check Point Knowledge Base, thankfully, google led me to this thread 🙂

What I can add is that for troubleshooting purposes, we changed the encryption method to "IKEv1 only" on both Cisco side and Check Point side, and tunnel and traffic worked fine.

If we switch back to IKEv2, tunnel is up, traffic reaches Cisco Side, but does not return to Check Point.

We need to disable force NAT-T on Cisco side (I did not try to force enable on Check Point side), so that everything works fine again.

So, in R77.30 it was working with encryption method IKEv2 yesterday, and after last nights upgrade to R80.20, it stopped working.

No errors on Check Point side, but I did not do a full debug. I only disabled Acceleration for the Cisco peer, for debugging purposes.

Can any one raise this for investigation purposes?

I did not open a case, because we needed to fix it or migrate the tunnel ASAP.

Thank you all 🙂

Timothy_Hall · ‎2020-01-07

Yeah I have not had the best of luck with IKEv2 in interoperable VPN situations. IKEv1 has been around a long time and works well.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

johnnyringo · ‎2021-02-23

I seem to be encountering a variation of this issue. Our side is a standalone R80.40 gateway operating on Google Cloud Platform, patched to latest version. Gateways in public cloud are going to be behind NAT, however we do allow udp/500, udp/4500, and ESP so while NAT-T is supported, it is not required. The other side is a Cisco ASA w/ software 9.8 not behind NAT.

We keep running in to strange 1-way traffic behavior, where traffic from them to us works fine but if initiate traffic on our end, the CheckPoint seems to attempt to bring up a second tunnel which ultimately fails with the "No proposal chosen" message. Initially this was blamed on an encryption domain mismatch but after correcting that the problem persists.

Unfortunately going back to IKEv1 is not an option as it doesn't meet the other sides security standards, nor is disabling NAT-T on the Cisco since it's a global setting and may be a requirement for other tunnels (Palo Alto and FortiGate allow for NAT-T control at the tunnel level, which is a nice feature, BTW).

And yes, there was definitely a NAT-T/IKEv2 behavior change done by Checkpoint between R77 and R80.30. I know because this it not the first time we've had a problem with it.

Tiago_Cerqueira · ‎2021-02-24

Hi,

I would love to check a packet capture of that issue :). Do you have one?

And yes, from R77 to R80.30 I'm definitively sure they changed behaviour

johnnyringo · ‎2021-02-24

I am working on a lab setup with an ASA later this week. I've already tried to replicate the behavior with a Cisco CSR1000v and Palo Alto VM-300, both in AWS. The first interesting thing is while there were zero issues with IKEv1 regardless of NAT-T, with IKEv2, I noticed the CheckPoint always negotiates NAT-T even if it's completely disabled on the other side. If the Cisco side has no crypto ipsec nat-transparency udp-encapsulation set in IOS or the Palo Alto has Enable NAT traversal unchecked, packet captures will show ESP from the other end (198.51.100.188) but the CheckPoint (10.10.100.4) trying to reply with NAT-T and then complain of an invalid SPI.

12:57:13.022722 IP 198.51.100.188 > 10.10.100.40: ESP(spi=0x14eec13d,seq=0x6), length 148
12:57:15.136804 IP 10.10.100.40.isakmp > 198.51.100.188.isakmp: isakmp: parent_sa ikev2_init[I]
12:57:16.102397 IP 10.10.100.40.ipsec-nat-t > 198.51.100.188.isakmp: isakmp:
12:57:17.091696 IP 198.51.100.188.isakmp > 10.10.100.40.isakmp: isakmp: child_sa inf2[I]

This sounds oddly like the behavior from sk165003 which was supposedly fixed a few months ago in R80.30 Take219 or R80.40 Take83. I'll open a case now, but not optimistic it will be productive as it's the 4th case I've opened, and TAC never even matched it to that bug.

Another funny thing: the CheckPoint is sending its main address in the IKE header, even if Link Selection -> Always Use this IP address -> Statically NATed IP and Source IP Address Settings -> Manual -> IP address of chosen interface have been configured in SmartConsole. The Palo Alto shows this quite clearly:

received ID_I (type ipaddr [10.20.30.39]) does not match peers id 02/24 09:23:48
ignoring unauthenticated notify payload (NAT_DETECTION_DESTINATION_IP) 02/24 09:23:48
ignoring unauthenticated notify payload (NAT_DETECTION_SOURCE_IP) 02/24 09:23:48

The only way to fix this is set the other side to expect the private IP in the "Identification" field. FortiGates suffer from a similar bug described here. This is probably specific to standalone gateways in GCP, since Clusters use the shared public IP as the Main IP address.

What still doesn't add up is if both sides negotiate IKEv2 and NAT-T, everything should work and I've confirmed this is the case with the CSR1000v and VM-300, whether the VPN be route-based or policy-based.

Timothy_Hall · ‎2021-02-24

Yeah you are kind of stuck there. If this is the only VPN configured on your Check Point you can force NAT-T by setting force_nat_t to true in GUIDBedit on the Check Point gateway object, but this is a gateway-level setting that will impact all site-to-site VPNs that gateway participates in.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

johnnyringo · ‎2021-02-24

Forcing NAT-T would actually work fine in our case. Like I said, NAT-T is supported, but it's not a requirement since the CheckPoint has a 1-to-1 NAT and is accepting udp/500 + ESP. The root problem though is even with a "successful" NAT-T tunnel, traffic initiated behind the CheckPoint seems to fail matching SAs on the Cisco. NAT-T could be a complete Red Herring, but it's unnerving that we keep seeing it with IKEv2 tunnels and don't with IKEv1.

Hopefully I find out more when I can lab an ASA and observe the behavior with IKEv1 vs v2.

johnnyringo · ‎2021-04-16

Been almost 2 months, but I'm still working on this case and wanted to followup. So there's essentially 3 different issues I've been observing:

With IKEv2, NAT-T is always negotiated, even if not required. This is expected; the RFCs for IKEv2 state that each side should compare its interface IP with the public IP configured for the tunnel, then negotiate NAT-T if there's a mismatch. This is somewhat flawed logic, because both sides could be behind one-to-one Static NATs where simply forwarding udp/500 and ESP will work fine, but different vendors seem to interpret the RFC differently and will just flip on NAT-T for any IKEv2 connection. AWS and GCP deployments can be assumed to negotiate NAT-T since the public IP address is not directly owned by the VM. I'm not sure about Azure.
With CheckPoints, the IKEv2 ID is always the Main IP of the gateway or cluster. This cannot be overridden via Link Selection, despite what support tells you. The only way to change the IKEv2 ID is to change the Main IP of the gateway or cluster.
Policy-based VPNs with IKEv2 have an addition problem: even after building IKEv2 SAs, the CheckPoint will still send its main IP in the IPsec Transform Sets, so SAs generated due to traffic initiated from the CheckPoint side will usually fail. Fortunately there is a fix for this: in Link Selection -> Always use this IP address -> Statically NATed IP, enter the public IP address of the gateway. Also, don't forget to set Source IP address settings -> IP address of chosen interface

Related thread: How do I change the local id for an IKEv2 IPsec VPN?

Tiago_Cerqueira · ‎2021-04-17

Could you finding number 2 possibly constitute of a bug in the software? Have you reported that to support and is RnD envolved?

johnnyringo · ‎2021-04-18

Yes, this is a bug or software defect, and Tier 3 support has essentially confirmed that. Just fire up VPN to a 3rd party interop device and examine the IKEv2 ID fields for yourself, which can be done in 20 minutes spending $2 in AWS fees. If CheckPoint is to lazy or incompetent to make this investment in their own products, why would I continue to invest any more time and effort in them?

Tiago_Cerqueira · ‎2020-05-27

Sorry it took me this long to check this message, must have slipped my email queue...

Did you manage to solve the issue?

Klaas · ‎2020-06-19

Hello,

we figured out, that this happens with a Cisco CSR as well.

And in addition we see, that the Check Point does not respond to UDP ESP packets from the CSR. (In case of this nat-t behavior)

Have a new idea for a bug bounty program:

For each bug CP should extend the CCSM Cert. for one month 🙂

Vincent_Bacher · ‎2020-06-19

Have a new idea for a bug bounty program:
For each bug CP should extend the CCSM Cert. for one month 🙂

👍👨‍🎓

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

Tiago_Cerqueira · ‎2020-06-19

My memory on this subject is not fresh (happened almost a year ago), but I believe that it doesn't reply because when one side is using NAT-T and the other one isn't, the checksums (not sure if it's the checksums or some other value) don't match on the opposite side. The firewall drops it correctly because it believes this to be a replay attack.

I do remember looking through the RFC and the PCAPs and debugs and couldn't find a single issue with the RFC from the Check Point side.

No CCSM for me, but I would appreciate a voucher for the TAC courses 😄

Are you a member of CheckMates?

VPN issue with IKEv2 and Cisco ASA