Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
jennyado
Collaborator

SMB 1550 (R81.10.17) VPN S2S Instability with Azure HA Cluster

Hi everyone,

I’d like to share a case we’ve been troubleshooting for about a month now to see if anyone has run into something similar or has ideas on what might be going on.

We have an SMB 1550 running R81.10.17 (build 996004721) establishing a Site-to-Site VPN to an HA Cluster hosted in Azure. The remote site has been reporting intermittent VPN failures, and the only way to restore connectivity is to reboot the SMB appliance. Initially, the issue occurred every 3–4 days; after updating to the current build (due to the resolved issue SMBGWY-12630, which is related to VPN stability), the frequency changed to about once per week.


Observed Behavior During the Incident

On the Azure Cluster side:

  1. We notice the problem because the SMB stops receiving traffic that normally comes through the VPN.

  2. Running vpn tu tlist -p <SMB DAIP> shows that every 1–5 minutes the cluster attempts to establish a new tunnel toward the SMB. Tunnel info changes as if a fresh negotiation occurred.

  3. A full vpn tu shows a long list of SAs that appear healthy.

  4. Comparing both outputs, we confirmed that each new entry seen in vpn tu tlist corresponds to a newly created SA.

  5. Packet capture (tcpdump) reveals:

    • From the cluster’s private VIP to the SMB’s public IP, we do see encrypted ESP traffic.

    • From the SMB public IP to the cluster’s private VIP, instead of ESP we see what look like renegotiation attempts (IKE/NAT-T).
      This is a piece of the capture:
      20:37:02.532607 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x37), length 100
      20:37:02.532614 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x37), length 100
      20:37:03.680310 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x38), length 132
      20:37:03.680314 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x38), length 132
      20:37:04.727356 IP [Public IP SMB].500 > [VIP private cluster].500: isakmp: parent_sa ikev2_init[I]
      20:37:04.727356 IP [Public IP SMB].500 > [VIP private cluster].500: isakmp: parent_sa ikev2_init[I]
      20:37:04.734975 IP [VIP private cluster].500 > [Public IP SMB].500: isakmp: parent_sa ikev2_init[R]
      20:37:04.734978 IP [VIP private cluster].500 > [Public IP SMB].500: isakmp: parent_sa ikev2_init[R]
      20:37:04.803387 IP [Public IP SMB].4500 > [VIP private cluster].4500: NONESP-encap: isakmp: child_sa ikev2_auth[I]
      20:37:04.803387 IP [Public IP SMB].4500 > [VIP private cluster].4500: NONESP-encap: isakmp: child_sa ikev2_auth[I]
      20:37:04.812500 IP [VIP private cluster].4500 > [Public IP SMB].4500: NONESP-encap: isakmp: child_sa ikev2_auth[R]
      20:37:04.812505 IP [VIP private cluster].4500 > [Public IP SMB].4500: NONESP-encap: isakmp: child_sa ikev2_auth[R]
      20:37:04.997259 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x39), length 132
      20:37:04.997263 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x0ab11aaf,seq=0x39), length 132
      20:37:05.016337 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x244c5ab3,seq=0x1), length 100
      20:37:05.016340 IP [VIP private cluster].4500 > [Public IP SMB].4500: UDP-encap: ESP(spi=0x244c5ab3,seq=0x1), length 100

Once the client reboots the SMB → everything works again until the next occurrence.


TAC Interaction So Far

TAC requested two rounds of VPN debugs:

  • The first attempt didn’t capture any information on the Azure cluster side.

  • The second attempt failed because the debugs weren’t started simultaneously on both ends.

We’re now waiting for the next incident so we can run both debugs in sync and also collect a traffic capture on the Azure side to confirm whether the cluster’s public IP is receiving packets from the SMB's public IP at the time of failure.

We’re still checking whether it's possible to run this validation directly from any Azure resource based on the architecture.


Additional Context

  • This issue started before upgrading the SMB.

  • The upgrade was applied because SMBGWY-12630 sounded highly relevant to VPN stability.

  • After upgrading, the frequency changed, which suggests the behavior is at least partially software-related.

  • The Azure cluster keeps sending encrypted traffic outward, but the SMB seems to fall into a renegotiation loop.

Any hints or shared experiences are appreciated — this has all the signs of one of those “it only fixes itself when you reboot it” demons.

Thanks in advance.

0 Kudos
1 Reply
PhoneBoy
Admin
Admin

Sounds like there are other issues involved above and beyond what SMBGWY-12630 fixed.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events