Re: Site to Site VPN issue during cluster failover

David_C1

Everyone:
We have been experiencing brief S2S VPN outages after a cluster failover. This is mostly experienced on the application side - I don't have a lot of firewall logs to give any hints of what is going on. Background information:

Site to site VPN between two centrally managed HA clusters. Both clusters are running R81.20 with JHFA Take 89. The VPN community is meshed, permanent tunnels are set, tunnel sharing is set to "One VPN tunnel per subnet pair." "IKEv2 only" is the encryption method.

In the firewall logs, I see a couple of IKE failures with message "Child SA exchange: Ended with error." I get this log a full minute after the failover. A number of seconds later I receive another IKE failover message: "Child SA exchange: Exchange failed: timeout reached."

I've gone down a number of rabbit holes on this one. One thing I found was on one cluster in the community "Maximum concurrent IKE negotiations:" is set to 200. On the other cluster in the community, it is set to 1000. Other items: on each cluster member, the vpn_queues table is empty:

And if it helps, here is the output showing some SA information on one cluster member:

I'm pretty much at a loss how to attack this. This is a highly critical environment and nobody likes it when connections are dropped. Any direction from the community would be appreciated.

Dave

the_rock

Every time I had that issue with a customer, we fixed it with setting I attached. I know its for policy install setting, but also worked for cluster failover. Honestly, I never really inquired with TAC for an explanation, so cant say why that would fix it 🙂

Andy

D_TK

yeah, i've experienced the same issue for years over different versions. i do not know what causes it, but we've "built in" a vpn tu/0 into all manual cluster failovers. If you do it immediately on the newly active member, you shouldn't suffer any vpn outage.

AaronCP

Hi Dave,

Have you tried disabling VPN acceleration? I had a similar issue at a previous company, albeit they were running R80.40 at the time.

When running vpnaccel off and performing a failover, the tunnel stayed up. Re-enabling VPN acceleration (vpnaccel on), the tunnel broke upon failover. Can't say for sure if it's relevant to your scenario, but may be worth a try.

If that doesn't work, I'd collect VPN debugs on both cluster members to see if they can provide more clues as to what's happening.

the_rock

Excellent point @AaronCP

David_C1

Thank you everyone for these suggestions, and I will consider them all. However, it feels as if these are all ad-hoc fixes to cover up some underlying issue - certainly tunnels/traffic should not be dropped during a cluster failover. I will continue to investigate and likely engage with TAC.

Dave

the_rock

I agree 100%, makes no sense to me either. I can tell you I worked with I cant even count how many customers who use clusterXL and VPN tunnels and we NEVER had this sort of problem. Here is what Im curious about, lets call your fws fw1 and fw2.

Scenarios/questions.

scenario 1: say fw1 is master, fails over to fw2 -> issue happens

scenario 2 : if you leave it with fw2 as master, no issues?

scenario 3: if you failover from fw2 back to fw1, does the problem happen again?

Andy

David_C1

Scenario 2: if we leave fw2 as master, everything operates as it should (once tunnels re-establish).

Scenario 3: if we fail back from fw2 to fw1, yes, problem re-occurs. Every time we failover it happens.

I suspect it may have to do with either a) the fact that we have "Accept Control Connections" disabled in global properties, and the manual items we had to put in place to provide the same functionality or b) The link selection mechanism we have in place, since there are multiple routes "out" of our firewall to the other site/end of the tunnel.

Dave

the_rock

Yes and yes, to both those arguments. Clearly, based on what you said, sounds to me, logically anyway, that there is something that causes fw1 not to function properly when this happens. Would you mind send outputs of below from both members, please blur out any sensitive data.

Andy

Commands:

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstat

cphaprob roles

David_C1

Andy,

Appreciate the offer to review the output of those commands, but I have done so and don't see anything unusual. Also, I forgot to add this, but I see the same behavior if we failover the cluster on the other end of this tunnel, so it is not something unique to this one cluster. We also have a "dev" version of this S2S VPN with two other sets of clusters - same thing happens when we failover those clusters. That's why I'm suspecting something more universal like the control connections setting and/or the link selection process.

Dave

the_rock

Thats totally fair and I agree with your assesment. Personally, I would re-enable that option to accept control connections in the next window and test.

Andy

the_rock

Hey Dave,

I did little more research on this while on a long call and found below, not sure if its overly relevant in your case, but something to consider...

Andy

Disabling IKED:

Issue: The VPN tunnel is handled by IKED instead of VPND, causing the source port to be 30500.
Fix: Disable IKED if the topology cannot be corrected.
- Temporary Change (does not survive reboot):
```
fw ctl set int ike_in_separate_daemon 0
```

Are you a member of CheckMates?

Site to Site VPN issue during cluster failover