Hi all,
I've got a star mesh community with 11 center gateway clusters and 5 satellite gateways. A mix of 5000 and 3000 series (R81.20) and some Quantum Sparks (R81.10). One of my center gateway clusters (3000) will not establish a tunnel with other gateways, with the exception of one of the Quantum Sparks. I'm also using Harmony VPN and that tunnel is active.
On the problem cluster in SmartView Monitor, it shows a "Down" state to most other gateways, but will show "Up - Phase 1" at times. Looking at my other gateways/clusters to the problem cluster, it's similar - some show "Up - Phase 1" and others show "Down."
In the logs, outgoing connection attempts from the problem cluster are rejected with the message, "Informational exchange: Sending notification to peer: Invalid IKE SPI IKE SPIs: 20cb86c6725e2650:e095fab9ae48e34d." Incoming attempts from other gateways/clusters are rejected with the message, "Child SA exchange: Exchange failed: timeout reached."
I'm also seeing some drops on the VPN blade, with the active member of the problem cluster as the Origin. There is little information - the destination is the problem cluster, but there is no source, service, etc. The VPN Peer Gateway is one of my other gateways/clusters.
I've tried resetting the tunnel, rebooting the problem gateways, other gateways, pushing policy, updating to the latest Take (118), deleting SA's via "vpn tu" and even removing the problem cluster from the VPN community and adding it back, but nothing changes. And it's driving me crazy that one tunnel gets established without an issue...
I should note that I don't know when this started. We have SD-WAN appliances at most of these sites, including the problem site, and traffic is routed through those as a primary, with the CP tunnels as backup. So nobody would really notice if the tunnel is down.
My next step is to open a ticket, but thought I'd ask here first. Thanks all.