Solved: Cloudguard R82 Site2Site intermittent failures af...

PetterD

I have a strange VPN Issue with Cloudguard R82.

Enviroment:

Azure: Cloudguard R82 T60+time fix (also tried T91) single gateway, 2 cores.
HQ: 6000 Appliance with R81.20 + SmartCenter R82 T91 (includes time-fix)

Setup/Changes

VPN Community between Azure Cloudguard and HQ Gatway been running fine for years.
Yesterday we upgraded SmartCenter to R82 and deployed new Cloudguard FW on R82.

Both gateways managed by the same SmartCenter.

Issue:

After installing a R82 Cloudguard, establishing SIC+license and policy push

IPSec VPN would just not work. Cloudguard R82 FW was sending "port unreachable" messages back to HQ FW.
I did a cpstop;cpstart on Cloudguard FW, then VPN was established but only for some networks.

At HQ we have a list of /24 networks only. On Cloudguard, we have one /16 network in encdomain.
However tunnel was established for various /30,/32,/28,/29 networks.. (supernetting in reverse).

Changed to "One VPN tunnel per gateway" on Community which seemed to work fine.
Then after a few hours, vpn stopped working again.. SA`s were up but "vpn tu tlist" showed tunnel as down.

A bunch of "port unreachable again" from Azure FW. Tried a "vpn tu" to reset tunnel with no change..
Did another cpstop;cpstart and it came up.. worked for 7-8 hours and it was down again.. for 45 mins until it suddently worked.
(We do have some reports that in these 7-8 hours there were several periods of 1,5,10-30 minutes of packetloss aswell)

This is the output of "vpn tu tlist" when issue is present, looks the same on both sides..
Then after 1,5,10,30 minutes its connected again..

Already have a TAC-case and several remote sessions already. Currently waiting for it to occur again to gather even more debugs.Only change was R82 Management + R82 Cloudguard. HQ FW have several other tunnels working just fine.

Anyone else experienced something like this?
The suspect here is definately the R82 Cloudguard..

CCSM / CCSE / CCVS / CCTE

PetterD

Looks like we may finally have found the root cause of the issue.. sk163835

After running "fw ctl zdebug + conn drop vm link nat xlate xltrc", TAC spot`ed that the NAT-TRAVERSAL packets from the Azure gateway (line 5, bold) were having the sourceport nated..

@;198991747.74218;[vs_0];[tid_0];[fw4_0];fwx_get_xlbuf: SRV xlation buffer found for request: vmside=1, cli->srv(1);
@;198991747.74219;[vs_0];[tid_0];[fw4_0];fwx_get_xldata: got (172.18.254.6,35d9,0.0.0.0,0 : 0) flags = 220, cli->serv (1);
@;198991747.74220;[vs_0];[tid_0];[fw4_0];fw_xlate_packet: connection <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17>, OUTBOUND(1);
@;198991747.74221;[vs_0];[tid_0];[fw4_0];fw_xlate: changing <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17> to <dir 0, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17>;
@;198991747.74222;[vs_0];[tid_0];[fw4_0];After POST VM: <dir 1, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17> (len=204) ;
@;198991747.74223;[vs_0];[tid_0];[fw4_0];POST VM Final action=ACCEPT;

The Azure VNET is 172.18.0.0/16 and the FrontendSubnet is part of this VNET.
The last HIDE-nat rule in the rulebase was hiding 172.18.0.0/16 behind the gateway.

This took a long time to find out, was working fine up to 7 hours on the most, and have been running fine for years on R81x, but after deploying R82 this suddently started getting messy. So definately a configuration issue but took some time to find due to the sudden intermittent failure on R82 ..

Implemented the following NAT-rules to test and currently been running stable for 9 hours with no icmp-unreachable messages in the logs, so fingers crossed this solves it permanently 🙂

CCSM / CCSE / CCVS / CCTE

View solution in original post

Chris_Atkinson

Check if you happen to see anything aligned to sk184507?

CCSM R77/R80/ELITE

PetterD

Hi,

Thanks for the tip!

I did check sk184507 but found no coredumps in /var/log/dump/usermode and no errormessage runnning ike-debug.
I also checked the iked process and looks like its last restart was yesterday during the latest cpstop;cpstart 😕

###########
[Expert@fwcpr82:0]# ps x|grep iked
4801 ? SLl 1:15 iked 0
32486 pts/1 S+ 0:00 grep --color=auto iked
[Expert@fwcpr82:0]# ps -p 4801 -o lstart=

Thu Apr 9 21:12:15 2026
[Expert@fwcpr82:0]#

########

Extremely frustrating with these intermittent failures, never know when the tunnel goes down and for how long so pretty hard to investigate 😕

CCSM / CCSE / CCVS / CCTE

PetterD

During the occurences (which happens infrequently, lasting 1-3-5-10-30 minutes) we are observing packet drops on the Azure FW on NAT-TRAVERSAL packets from the On Prem FW

[Expert@AZUREFW:0]# fw ctl zdebug + drop |grep ONPREMISEFWIP
@;73289377.32700;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289387.32736;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289623.33069;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289640.33114;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289696.33197;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289703.33207;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;

Im seeing this drop other places related to NAT. Since the Azure Firewall does not actually have a public IP on the gateway in the topology, the public IP is manually defined under "statically nated" on the object. Which has been working fine for years, until R82..

CCSM / CCSE / CCVS / CCTE

PetterD

Looks like we may finally have found the root cause of the issue.. sk163835

After running "fw ctl zdebug + conn drop vm link nat xlate xltrc", TAC spot`ed that the NAT-TRAVERSAL packets from the Azure gateway (line 5, bold) were having the sourceport nated..

@;198991747.74218;[vs_0];[tid_0];[fw4_0];fwx_get_xlbuf: SRV xlation buffer found for request: vmside=1, cli->srv(1);
@;198991747.74219;[vs_0];[tid_0];[fw4_0];fwx_get_xldata: got (172.18.254.6,35d9,0.0.0.0,0 : 0) flags = 220, cli->serv (1);
@;198991747.74220;[vs_0];[tid_0];[fw4_0];fw_xlate_packet: connection <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17>, OUTBOUND(1);
@;198991747.74221;[vs_0];[tid_0];[fw4_0];fw_xlate: changing <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17> to <dir 0, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17>;
@;198991747.74222;[vs_0];[tid_0];[fw4_0];After POST VM: <dir 1, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17> (len=204) ;
@;198991747.74223;[vs_0];[tid_0];[fw4_0];POST VM Final action=ACCEPT;

The Azure VNET is 172.18.0.0/16 and the FrontendSubnet is part of this VNET.
The last HIDE-nat rule in the rulebase was hiding 172.18.0.0/16 behind the gateway.

This took a long time to find out, was working fine up to 7 hours on the most, and have been running fine for years on R81x, but after deploying R82 this suddently started getting messy. So definately a configuration issue but took some time to find due to the sudden intermittent failure on R82 ..

Implemented the following NAT-rules to test and currently been running stable for 9 hours with no icmp-unreachable messages in the logs, so fingers crossed this solves it permanently 🙂

CCSM / CCSE / CCVS / CCTE

CarlosCP

please keep us posted - running into similar issues with a cloudguard gateway.

PetterD

Hi,

See my last post, issue was resolved by creating no-nat rules for ike/nat-t from the gayeways wan IP in the Frontend subnet!

Having a hide-nat for the whole vnet that includes the Frontend subnet (and therefore also the cloudguard) was a tripwire that started causing issues after the upgrade. Took some time (and pain) to figure out;)

CCSM / CCSE / CCVS / CCTE

Are you a member of CheckMates?

Cloudguard R82 Site2Site intermittent failures after upgrade to R82.