Solved: Re: VPN route based query

the_rock · ‎2024-06-11

Hey everyone,

Hope someone can maybe give a good suggestion/idea about this. So I was helping a hospital with route based VPN tunnel from their CP cluster to Palo Alto and this tunnel had been there since 2020 I think, but always working intermittently.

PAN guy was saying that for some odd reason, when there is an issue, they see ID on their end as 0.0.0.0, though it should be 169.254.0.103, which is whats configured. Im not really sure why that would happen and if its something related to CP or Palo Alto.

For the context, all other vpn tunnels on CP side work just fine, just this one. Any clue as to why the ID would show different on peer side?

We did not have time to really debug, so simply ended up lowering the encryption methods and that brought up the tunnel.

Thanks as always for any ideas. Just for the context, we tried unnumbered VTI as well, that would send ID 0.0.0.0, so thats expected, but with numbered VTI, definitely should NOT.

Andy

Best,
Andy

the_rock · ‎2024-07-10

Though we dont know for sure until next maintenance window, Im fairly positive this is NOT CP fw issue at this point. Thanks a lot @AmirArama for being so kind to get the ike trace file and confirm that ID being sent to PAN fw is not 0.0.0.0 as they claimed during the call. We also verified with TAC that correct auth methods were being used, so at this time, if they have issue next time, I will advise them to open the case with Palo Alto support and investigate further.

Appreciate everyone who responded.

Best,

Andy

Best,
Andy

View solution in original post

Timothy_Hall · ‎2024-06-12

When you say "ID" do you mean Proxy-ID (subnets/domains) or IKE Peer ID?

By default Palo Alto uses route-based VPNs and will propose a universal tunnel (0.0.0.0/0, 0.0.0.0/0 - one tunnel per gateway pair) in IKE Phase 2, although they can be configured to mimic a domain-based VPNs and propose specific subnets similar to "pair of subnets" on the Check Point side. Whether you are using an unnumbered or numbered VTI doesn't affect the Proxy-ID negotiation, at least to my knowledge.

Using IKEv1 I presume? IKEv2 has had some rather nasty interoperability issues, the most prominent of which was tunnel narrowing.

Another line of inquiry would be if the tunnel being initiated in one direction or another is affecting the stability. So for example if the Palo initiates the tunnel it is stable, but when the Check Point initiates the tunnel it is not.

Also check the obvious things like making sure the Phase 1 & Phase 2 timers match, the Palo is not configured for a data lifesize, idle timer, or anything else that could bring down the tunnel prematurely which can affect stability.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

the_rock · ‎2024-06-12

Hey Tim,

Thanks a lot for your insight, as always. I meant ike peer ID, not proxy ID, should have clarified.

Funny thing is I asked the guy on their end, which uses palo alto firewall about route based tunnel and he said they do have actual enc domain configured, which consists of 4 subnets (como of /24, /27 and /28 nets) and single host (/32).

As I mentioned, I find it so odd that on Cp side, this is the ONLY tunnel, out of probably 10 or so, they constantly have issues with, all the other ones work fine. Ironically enough, guy told me that on the other end, they also only have issues with this tunnel to CP lol

Anywho, yes, we confirmed phase 1 and phase 2 settings many times and any time you lower enc. settings, all works fine, no issues, but if you try using say aes-256 / sha256/384/512, nothing works.

Im really not sure how to approach this at this point...I will see if I can have a remote session with guy who has access to palo alto firewall, as Im somewhat faimilar with those, though its been some time since I worked on them, but I do remember the basics. I wish their VPN config was as easy as Fortinet, which I find works 100% of the time.

O well, not everything is simple, I guess : - )

Andy

Best,
Andy

Timothy_Hall · ‎2024-06-13

Lowering the strength of the encryption algorithms shouldn't make a difference, but one point of pain in the past I've seen has been PFS which can be spotty as far as interoperability. Maybe try leaving PFS off and increasing the other algorithm strengths for Phase 1 and Phase 2 and see what happens?

Only other thing I can think of is there are some differences in the VPN code implementations in sim/fastpath vs. on the worker cores, especially with some of the higher-bit hash algorithms which were implemented in sim/SecureXL relatively recently. You could try excluding only this VPN peer and its tunnels from SexureXL acceleration with vpn accel off (peer IP) and see if that stabilizes things: sk151114: "fwaccel off" does not have an effect on disabling acceleration of VPN tunnels in R80.20 a...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

the_rock · ‎2024-06-13

I know PFS was on, DH group 14, so something to consider as well, though we did change it to group 2, but kept PFS. Did not touch vpn accel, but definitely something to consider.

Thanks Tim.

Andy

Best,
Andy

CaseyB · ‎2024-06-19

Since you bring up the encryption settings, is the Palo limiting their proposed encryption settings? Check Point has issues with getting a ton of options. I had this issue a few months ago with a vendor migrating to Palo.

the_rock · ‎2024-06-19

I honestly dont know mate, sorry. All I know is that this worked fine before, but Im not the guy to wonder why things worked in the past and they dont now...I was 5 years younger 5 years ago and now Im not lol

Anywho, we will have a call next week and see how they wish to proceed. Im thinking we may get TAC cases going for both CP and PAN support and see how far we get. Like anything, we have to eliminate certain things to get logical sense of whats causing this. For now, tunnel works fine, but we do not want to leave it as such with weak encryption settings. There is an app doctors and nurses use through this tunnel to access patient records, so if tunnel is broke, its literally a matter of life and death, considering this is used strictly in that hospital's emergency room, so we have to make sure 100% tunnel is up, no matter what.

Andy

Best,
Andy

CaseyB · ‎2024-06-19

I hear you; I specifically manage tunnels for a hospital. Lots of vendor interoperability going on here.

If it is the same thing, these are the messages I was finding.

Child SA exchange: Sending notification to peer: No proposal chosen MyMethods Phase2: AES-GCM-256 + HMAC-SHA2-256, No IPComp, No ESN, Group 19 (256-bit random ECP group)
Initial exchange: Sending notification to peer: Invalid Key Exchange payload
Child SA exchange: Exchange failed: timeout reached.

IKEv2 negotiation for Site-to-Site VPN tunnel with 3rd party peer fails if IKEv2 SA payload contains...

the_rock · ‎2024-06-19

Hi @CaseyB ...thanks for that sk. I dont believe its relevant here sadly, as errors mostly showed no response from peer. Plus, they are on latest R81.20 version + jumbo 65.

Below is snippet of some errors.

Andy

Line 89966: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][tunnel] tnlmon_init: added gateway = 10.66.52.18 to GWArray

Line 89970: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][IKED] get_iked_handler_for_address: daemon index for 10.66.52.18 is 2. number of iked daemons: 3

Line 89971: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][IKED] is_handled_by_local_daemon: this is IKED 0, 10.66.52.18 is handled by IKED 2

Line 89987: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][tunnel] RIM_Sync: Updating route for peer <10.66.52.18, 1>

Line 89988: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][tunnel] tnlmon_db_get: Failed to get gateway = 10.66.52.18, type = 1 values

Line 89989: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:31:59][tunnel] RIM_Sync: failed to retrieve the attributes for dbkey = <10.66.52.18, 1>

Line 93806: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:41:05] GetEntryIsakmpObjectsHash: received ipaddr: 10.66.52.18 as key, found fwobj: NULL

Line 93807: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:41:05] InsertXIsakmpObjectsHash: inserting interface: 10.66.52.18 for object: g_htrs

Line 93888: [iked0 4129 4067185088]@cmhfw1[11 Jun 20:41:05][tunnel] VIF_Diff_GetVPNIfTable: Inserting <36, 10.66.52.18>

Best,
Andy

Duane_Toler · ‎2024-06-19

Is 10.66.52.18 the Palo peer? If so, I notice the debug says this is being handled by IKED instance 2. Have you looked through that debug? $FWDIR/log/iked2.elg

Does one side have "Prefer IKEv2 but support IKEv1" enabled? I had a VPN on a Check Point gateway talking to a peer with an unknown device and I was seeing dual SAs being attempted. One of the ends did not appreciate that and we had a flaky VPN, too. We swapped the Check Point side to IKEv2 only and it helped.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-19

Thats right Duane, that IP is PAN side of the tunnel. We tried all the combinations to make it work with strong enc. algorithms, nothing helped, until we switched to ike v1 on both sides and lower onc methods.

I think what @AmirArama said is definitely best idea. We will have call next week with everyone and see how they wish to proceed. I will update the thread once we have more info.

Thanks as always for the help.

Best,

Andy

Best,
Andy

Duane_Toler · ‎2024-06-19

Yeah I agree. Debug it as hard and heavy as you can for whatever maintenance window you can get. That's wild. Good luck and godspeed to all. Maybe you can get TAC Tier 3 or High-End team on it, too.

Another item I haven't seen mentioned here yet: Do you have DPD / Permanent Tunnels enabled on this VPN? Likewise, does PAN have DPD and or its own "Tunnel Monitoring" enabled? In a completely grabbing-at-straws effort, I scoped out an off-the-cuff PAN article about their version of it which also doesn't require RFC-compliant DPD. I have zero intel about this, tho. My PAN-fu is nil.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-19

Yep, DPD is enabled on both sides, though same issue was there even when it was not. Man, this is driving me bonkers, but I wont lose more sleep over it...lol

Andy

Best,
Andy

Duane_Toler · ‎2024-06-20

I dunno how relevant this PAN doc was, but it did say it had two kinds of Tunnel Monitoring: one was DPD and one was a ping of sorts and either of them would bring down the tunnel. https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000ClFaCAK#:~:text=DPD%20is%2...

Again, I have zero background on this and I'm just parsing "vendor words" to mean something. This might not even be useful, but in the off chance it is... then ok.

For DPD on the Check Point side, you'll need Permanent Tunnels enabled to do the same. The Site-to-Site VPN admin guide has a whole section on DPD that might be worth a read, plus a ckp_regedit command that could be useful.

Good luck!

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-06-20

Thanks man, will definitely keep that article in mind and send it to the guy that has PAN access. This is why I love the community, EVERYONE is so KIND and HELPFUL.

I truly appreciate it brother.

Have a nice weekend! Well, its my weekend already, since I have Friday off lol

Andy

Best,
Andy

the_rock · ‎2024-06-20

Keep in mind that starting R81 or R81.10, when you enable permanent tunnel setting inside the community, interoperable object belonging to that community automatically shows up as DPD in guidbedit. We also double checked and sure enough, thats exactly what it showed, not tunnel_test, which is default.It all makes sense, considering customer is on R81.20, cant recall what jumbo, though thats not relevant in this case.

Best,

Andy

Best,
Andy

the_rock · ‎2024-06-12

Forgot to say, yes, we ended up using ikev1, but again, even with ikev2, it works, but its intermittent. I cant say 100% if its same with ikev1, but will verify with them.

Thanks again.

Andy

Best,
Andy

AmirArama · ‎2024-06-19

Hi,
i guess it's not related to Route vs. Domain based VPN.

my suggestion: enable VPN debug on the GW, dig in, and look for the sent ID. that's the only way i know to analyze the negotiation.

the_rock · ‎2024-06-19

I think you are absolutely correct @AmirArama . Plan is to have guy on the other end open PAN case and they can check further. So far, we cant see any obvious issues on CP end.

Andy

Best,
Andy

the_rock · ‎2024-06-19

I will mark your post as solution for now, as it definitely makes lots of sense, but will update the thread once we make it work with strong enc. methods to share how it was done.

Best,

Andy

Best,
Andy

the_rock · ‎2024-07-09

Hey guys,

Just to update on this, we had remote with TAC, could not get tunnel working with higher enc. methods and ikev2, TAC guy checked the debugs, no luck there either, so we had to lower the enc methods to get the tunnel working. I guess at this point it remains a mystery...lol.

We even turned off vpn accel, disabled PFS, exact same issue...at one point tunnel showed as up on CP side, but nothing on PAN end. Their debugs did not show anything useful at all.

Andy

Best,
Andy

Duane_Toler · ‎2024-07-09

Is this VSX, by chance? Or do you have a gateway with multiple external interfaces and the VPN peer routed out a non-default route interface? I have a customer where I use this scenario and use the BestRoutingSenderIP registry setting to adjust IKE IDs per outgoing interface (for PSK VPNs; obviously, certificates wouldn't need this).

After upgrading to R81.20, we hit a bug in VSX where it was always using 0.0.0.0 as the outgoing IKE ID and even the ESP packets were being sent with source IP 0.0.0.0. Weirdly, only one VPN tunnel was behaving this way, too. This was fixed with a kernel parameter edit, which is documented in sk160672:

fw ctl set -f int ipsec_use_p1_src_ip=1

I even saw this in 'fw monitor':

15:54:56.194367 IP 0.0.0.0 > x.x.x.19: ESP(spi=0x74fe85d6,seq=0x45), length 96

Yours might not be behaving this way, but maybe it's worth running this by TAC.

As Timothy was saying, I've also seen weird AES-256/AES-GCM-256 interoperability issues, but these now seem to have been resolved in earlier JHFs. You said you're on JHF 65 now, so these issues seem statistically less likely to be the case... but perhaps you have a new one.

Good luck, again, and I hope the last round of debugs provided you and TAC with something useful.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

the_rock · ‎2024-07-09

Hey Duane,

Thanks for that mate, but its not vsx. I reviewed the debugs myself, since TAC guy said it would take some time and I think I got a pretty good idea what might be the issue.

Will update tomorrow once I verify.

Andy

Best,
Andy

the_rock · ‎2024-07-10

Hey @Duane_Toler

Here is one thing I find interesting from the debugs, again, this is just me, I could be way off here...

So, this was when tunnel was set as PERMANENT (though not sure that even matters here), and tunel mgmt set "per gateway"

remote peer is 10.66.52.18

it was using ikev1, 3des/sha1

from the debug:

Line 1036: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1048: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1063: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1075: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1128: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1140: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] peer: (ext addr: 10.66.52.18). peer_ip: 0.0.0.0 Using port 500
Line 1150: [iked2 4131 4066877888]@cmhfw1[9 Jul 18:45:00][ikev2] SPI: 1e26c09c peer: 10.66.52.18

CP id is presenting 0.0.0.0, which sort of makes sense, as it was set as permanent tunnel, BUT, here is the "kicker"...once we disabled permanent tunnel and set instead per gateway to "per subnet" and installed policy ()we also changed to ikev2, aes256/sha256), it was exact same issue, it was still presenting 0.0.0.0 to PAN, though enc domains are ONLY subnets, no host.

To make it work, we changed back to ikev1, low enc methods and bam, all good again.

Here is what I believe...maybe to make this work, we should leave tunnel as non permanent, but change per gateway?

We have not tried that scenario...thoughts?

Andy

Best,
Andy

AmirArama · ‎2024-07-10

I assume this is debug from the PAN?

i don't know what it means that PAN say the peer ip is 0.0.0.0, is that mean they receive it from CP? How exactly? Ask them to show you the exact ike packet we send 0.0.0.0

on that only PAN support can respond.

There is a difference between traffic selector of 0.0.0.0 which is negotiated for universal tunnel such as with tunnel per gw or route based vpn tunnel, to ID as 0.0.0.0 which is the ID the gw send in the ike neg. (In the debug it says peer IP (not ID, not TS.. again ask PAN).

Now, on the CP debug what ID do you see its send towards the PAN?

Easily you can it in the ike.xmll file for ike v2 as the id. ("IPV4_ADDR" or other and what value exactly)

If you really see it sent 0.0.0.0 as the ID (not traffic selector) can you check what is selected in the link selection tab of your CP GW?

the_rock · ‎2024-07-10

Hey Amir,

Thanks a lot for responding, appreciated! So, to clarify, we do not have any PAN debugs, thats all from CP end. Sadly, guy on the other end that has access to PAN firewall, thats not PAN tac person, not sure case with them was ever opened, but in all honesty, Im sure ticket is not even needed, because we know 100% that when ikev2 is seclected and using high enc methods, tunnel never works, as they see CP sending id 0.0.0.0, but as soon as its changed to ike v1 and low enc methods, works fine and its NOT showing as 0.0.0.0 on PAN end as id being sent from CP fw.

Now, why that is, I have no clue in the world. Thats why I was thinking, would it make sense to say try "per gateway" option and leave as non permanent? I feel like thats only thing we had not tried...

Andy

Best,
Andy

AmirArama · ‎2024-07-10

Tunnel mgmt has nothing to do with the ID sent.

Is your CP GW configured as DAIP? what is configured in this GW object Link selection tab?

What is the gw version?

You also can change the the ike v2 ID to FQDN. Look for the sk.

the_rock · ‎2024-07-10

Sorry, I thought I put the screenshot, but maybe not. Link selection is configured as per below (IP of the local CP is listed there. No, its NOT daip.

Andy

Best,
Andy

AmirArama · ‎2024-07-10

Ok.

So i would check ikev2.xmll file on the gw while 'vpn debug trunc ALL=5' is on. And you use ikev2.

Look under the relevant peer under authentication.

The type should be IPV4_ADDR usually (where at the screenshot it says KEY_ID)

Then at the data see the actual ID sent. And weather it's indeed 0.0.0.0 or not.

If it is 0.0.0.0 you have something to show to TAC to investigate, or just consider change to FQDN as mentioned.

If its other valid value. Then it means you don't send 0.0.0.0 as ID. And i would ask PAN side for the proof of ike packet coming with 0.0

the_rock · ‎2024-07-10

Thanks a lot for that! Sadly, I dont see that file in the debugs TAC attached to the case, so I updated the case to have them verify if the file is on sftp possible.

I will review other debug files to see if I can get this info.

Andy

Best,
Andy

Are you a member of CheckMates?

VPN Check Point - Palo Alto issue