Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
PetterD
Collaborator
Jump to solution

Cloudguard R82 Site2Site intermittent failures after upgrade to R82.

I have a strange VPN Issue with Cloudguard R82.


Enviroment:

Azure: Cloudguard R82 T60+time fix (also tried T91) single gateway, 2 cores.
HQ: 6000 Appliance with R81.20 + SmartCenter R82 T91 (includes time-fix)


Setup/Changes

VPN Community between Azure Cloudguard and HQ Gatway been running fine for years.
Yesterday we upgraded SmartCenter to R82 and deployed new Cloudguard FW on R82.

Both gateways managed by the same SmartCenter.

 

Issue:

After installing a R82 Cloudguard, establishing SIC+license and policy push

IPSec VPN would just not work. Cloudguard R82 FW was sending "port unreachable" messages back to HQ FW.
I did a cpstop;cpstart on Cloudguard FW, then VPN was established but only for some networks.

At HQ we have a list of /24 networks only. On Cloudguard, we have one /16 network in encdomain.
However tunnel was established for various /30,/32,/28,/29 networks.. (supernetting in reverse).

Changed to "One VPN tunnel per gateway" on Community which seemed to work fine.
Then after a few hours, vpn stopped working again.. SA`s were up but "vpn tu tlist" showed tunnel as down.

A bunch of "port unreachable again" from Azure FW. Tried a "vpn tu" to reset tunnel with no change..
Did another cpstop;cpstart and it came up.. worked for 7-8 hours and it was down again.. for 45 mins until it suddently worked.
(We do have some reports that in these 7-8 hours there were several periods of 1,5,10-30 minutes of packetloss aswell)


This is the output of "vpn tu tlist" when issue is present, looks the same on both sides..
Then after 1,5,10,30 minutes its connected again..  


+-----------------------------------------+----------------------------------+---------------------+
| Peer: IP-IN-AZURE - FWAzure | MSA: 7fe6e4631258 | i: 0 ref: 15 |
| Methods: ESP Tunnel AES-GCM-256 | | i: 1 ref: 15 |
| My TS: 0.0.0.0/0 | | i: 2 ref: 19 |
| Peer TS: 0.0.0.0/0 | | |
| MSPI: 1000298 (i: 2, p: 0, d: 0) | No outbound SPI | |
| Tunnel created: | NAT-T | |
| Tunnel expiration: | Disconnected | |
+-----------------------------------------+----------------------------------+---------------------+


Already have a TAC-case and several remote sessions already. Currently waiting for it to occur again to gather even more debugs.Only change was R82 Management + R82 Cloudguard. HQ FW have several other tunnels working just fine.

 

Anyone else experienced something like this?
The suspect here is definately the R82 Cloudguard..

CCSM / CCSE / CCVS / CCTE
0 Kudos
1 Solution

Accepted Solutions
PetterD
Collaborator

Looks like we may finally have found the root cause of the issue.. sk163835

After running "fw ctl zdebug + conn drop vm link nat xlate xltrc", TAC spot`ed that the NAT-TRAVERSAL packets from the Azure gateway (line 5, bold) were having the sourceport nated..


@;198991747.74218;[vs_0];[tid_0];[fw4_0];fwx_get_xlbuf: SRV xlation buffer found for request: vmside=1, cli->srv(1);
@;198991747.74219;[vs_0];[tid_0];[fw4_0];fwx_get_xldata: got (172.18.254.6,35d9,0.0.0.0,0 : 0) flags = 220, cli->serv (1);
@;198991747.74220;[vs_0];[tid_0];[fw4_0];fw_xlate_packet: connection <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17>, OUTBOUND(1);
@;198991747.74221;[vs_0];[tid_0];[fw4_0];fw_xlate: changing <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17> to <dir 0, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17>;
@;198991747.74222;[vs_0];[tid_0];[fw4_0];After POST VM: <dir 1, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17> (len=204) ;
@;198991747.74223;[vs_0];[tid_0];[fw4_0];POST VM Final action=ACCEPT;


The Azure VNET is 172.18.0.0/16 and the FrontendSubnet is part of this VNET.
The last HIDE-nat rule in the rulebase was hiding 172.18.0.0/16 behind the gateway.


This took a long time to find out, was working fine up to 7 hours on the most, and have been running fine for years on R81x, but after deploying R82 this suddently started getting messy. So definately a configuration issue but took some time to find due to the sudden intermittent failure on R82 ..

Implemented the following NAT-rules to test and currently been running stable for 9 hours with no icmp-unreachable messages in the logs, so fingers crossed this solves it permanently 🙂

 
 

signal-2026-04-12-062112_004.png

 

CCSM / CCSE / CCVS / CCTE

View solution in original post

6 Replies
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Check if you happen to see anything aligned to sk184507?

CCSM R77/R80/ELITE
0 Kudos
PetterD
Collaborator

Hi,

Thanks for the tip!

I did check sk184507 but found no coredumps in /var/log/dump/usermode and no errormessage runnning ike-debug.
I also checked the iked process and looks like its last restart was yesterday during the latest cpstop;cpstart 😕

###########
[Expert@fwcpr82:0]# ps x|grep iked
4801 ? SLl 1:15 iked 0
32486 pts/1 S+ 0:00 grep --color=auto iked
[Expert@fwcpr82:0]# ps -p 4801 -o lstart=

Thu Apr 9 21:12:15 2026
[Expert@fwcpr82:0]#

########

Extremely frustrating with these  intermittent failures, never  know when the tunnel goes down and  for how long so pretty hard to investigate 😕

CCSM / CCSE / CCVS / CCTE
0 Kudos
PetterD
Collaborator

During the occurences  (which happens infrequently,  lasting 1-3-5-10-30 minutes) we are  observing packet  drops  on  the  Azure  FW  on  NAT-TRAVERSAL  packets  from the  On Prem FW

[Expert@AZUREFW:0]# fw ctl zdebug + drop |grep ONPREMISEFWIP
@;73289377.32700;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289387.32736;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289623.33069;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289640.33114;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289696.33197;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;
@;73289703.33207;[vs_0];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 ONPREMISEFWIP:4500 ->AZUREFWWANIP:4500 dropped by fw_handle_first_packet Reason: fwconn_key_init_links (INBOUND) failed;


Im seeing this drop other places related to NAT. Since the  Azure Firewall does not actually have a public IP on the gateway in the  topology, the public IP is manually defined  under  "statically nated" on the object.  Which has been working fine for years,  until  R82..

CCSM / CCSE / CCVS / CCTE
0 Kudos
PetterD
Collaborator

Looks like we may finally have found the root cause of the issue.. sk163835

After running "fw ctl zdebug + conn drop vm link nat xlate xltrc", TAC spot`ed that the NAT-TRAVERSAL packets from the Azure gateway (line 5, bold) were having the sourceport nated..


@;198991747.74218;[vs_0];[tid_0];[fw4_0];fwx_get_xlbuf: SRV xlation buffer found for request: vmside=1, cli->srv(1);
@;198991747.74219;[vs_0];[tid_0];[fw4_0];fwx_get_xldata: got (172.18.254.6,35d9,0.0.0.0,0 : 0) flags = 220, cli->serv (1);
@;198991747.74220;[vs_0];[tid_0];[fw4_0];fw_xlate_packet: connection <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17>, OUTBOUND(1);
@;198991747.74221;[vs_0];[tid_0];[fw4_0];fw_xlate: changing <dir 1, 172.18.254.6:4500 -> ONPREMISEGATEWAYIP:4500 IPP 17> to <dir 0, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17>;
@;198991747.74222;[vs_0];[tid_0];[fw4_0];After POST VM: <dir 1, 172.18.254.6:13785 -> ONPREMISEGATEWAYIP:4500 IPP 17> (len=204) ;
@;198991747.74223;[vs_0];[tid_0];[fw4_0];POST VM Final action=ACCEPT;


The Azure VNET is 172.18.0.0/16 and the FrontendSubnet is part of this VNET.
The last HIDE-nat rule in the rulebase was hiding 172.18.0.0/16 behind the gateway.


This took a long time to find out, was working fine up to 7 hours on the most, and have been running fine for years on R81x, but after deploying R82 this suddently started getting messy. So definately a configuration issue but took some time to find due to the sudden intermittent failure on R82 ..

Implemented the following NAT-rules to test and currently been running stable for 9 hours with no icmp-unreachable messages in the logs, so fingers crossed this solves it permanently 🙂

 
 

signal-2026-04-12-062112_004.png

 

CCSM / CCSE / CCVS / CCTE
CarlosCP
Employee
Employee

please keep us posted - running into similar issues with a cloudguard gateway. 

0 Kudos
PetterD
Collaborator

Hi,

See my last post, issue was resolved by creating no-nat rules for ike/nat-t from the gayeways wan IP in the Frontend subnet!

Having a hide-nat for the whole vnet that includes the Frontend subnet (and therefore also the cloudguard) was a tripwire that started causing issues after the upgrade. Took some time (and pain) to figure out;)

CCSM / CCSE / CCVS / CCTE

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events