Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Jamie_Kelahan
Participant

One cluster not establishing s2s tunnels - Invalid IKE SPI

Hi all,

I've got a star mesh community with 11 center gateway clusters and 5 satellite gateways.  A mix of 5000 and 3000 series (R81.20) and some Quantum Sparks (R81.10).  One of my center gateway clusters (3000) will not establish a tunnel with other gateways, with the exception of one of the Quantum Sparks.  I'm also using Harmony VPN and that tunnel is active.

On the problem cluster in SmartView Monitor, it shows a "Down" state to most other gateways, but will show "Up - Phase 1"  at times.  Looking at my other gateways/clusters to the problem cluster, it's similar - some show "Up - Phase 1" and others show "Down."

In the logs, outgoing connection attempts from the problem cluster are rejected with the message, "Informational exchange: Sending notification to peer: Invalid IKE SPI IKE SPIs: 20cb86c6725e2650:e095fab9ae48e34d."  Incoming attempts from other gateways/clusters are rejected with the message, "Child SA exchange: Exchange failed: timeout reached."

I'm also seeing some drops on the VPN blade, with the active member of the problem cluster as the Origin.  There is little information - the destination is the problem cluster, but there is no source, service, etc.  The VPN Peer Gateway is one of my other gateways/clusters.

I've tried resetting the tunnel, rebooting the problem gateways, other gateways, pushing policy, updating to the latest Take (118), deleting SA's via "vpn tu" and even removing the problem cluster from the VPN community and adding it back, but nothing changes.  And it's driving me crazy that one tunnel gets established without an issue...

I should note that I don't know when this started.  We have SD-WAN appliances at most of these sites, including the problem site, and traffic is routed through those as a primary, with the CP tunnels as backup.  So nobody would really notice if the tunnel is down.

My next step is to open a ticket, but thought I'd ask here first.  Thanks all.

0 Kudos
7 Replies
AmirArama
Employee
Employee

I would start with traffic capture and drops on both sides

1. tcpdump -nnei ethX host x.x.x.x
(replace ethX with eth name of the outgoing interface facing the peer). replace x.x.x.x with the peer IP which the tunnel is negotiated to.)
*you can save it to a file by adding: -w /var/log/tcpdump.pcap    to the end of the command

2. fw monitor -F "0,0,<peerip>,0,0" -F "<peerip>,0,0,0,0"
(replace peerip with actual peer IP address)
*you can save it to a file by adding: -o /var/log/fwmon.pcap     at the end of the command

3. fw ctl zdebug + drop 

*you can save it to a file by adding:  >> zdebugdrop.txt    at the end of the command

see if all ike packets are reaching properly side to side and no drops on ike packets.

if it's a cluster, make sure the traffic is NATTED properly from phydical IP to VIP, and vice versa (in fw monitor), and that there is no NAT applied on the source IP/PORT (except for NAT TO VIP source IP if cluster)

if all packets reaching properly side to side, enable vpn debug, and open it with IkeView or let TAC handle it.

BTW, Checkpoint has SD-WAN product already, so you may consider unifying solutions if that suits your needs. 

Jamie_Kelahan
Participant

Thanks for the suggestions.  I haven't had much time to get into this today, but I did some packet captures.  tcpdump appears to be fine as I see traffic and acks.  fw monitor capture has a lot of malformed packets.  I compared to another fw monitor capture between the broken site and the one it actually has a tunnel established to and don't see those malformed packet messages, so I don't know what that means.

zdebug shows "dropped by fwhold_expires Reason: held chain expired" errors on the broken side and "dropped by vpn_drop_and_log Reason: Failure preparing tunnel creation, internal error" on the peer side.

Again, I didn't have a lot of time to spend on it today, but will continue digging.

0 Kudos
Jamie_Kelahan
Participant

And I'm not sure what you mean by it being NATTED properly in fw monitor?  They are both clusters and I see traffic to/from the public VIP addresses.  Is there something more I should be looking for?

 

0 Kudos
the_rock
MVP Gold
MVP Gold

You should see all 4 inspection points and if you do NOT see big O for post outbound, I believe that would imply traffic is encrypted, but capture should also show you if its natter. Be free to refer to this great site my colleague made while ago.

https://tcpdump101.com

Not sure why takes me to tvpdump101, but its tcpdump101.com

Best,
Andy
0 Kudos
the_rock
MVP Gold
MVP Gold

I second what @AmirArama  had duggested.

Best,
Andy
0 Kudos
the_rock
MVP Gold
MVP Gold

Can you see any logs containing "key install"? That may give us clue when issue may had happened.

Best,
Andy
0 Kudos
the_rock
MVP Gold
MVP Gold

Hey mate,

Please let us know how this gets resolved.

Best,
Andy
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events