Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Juan_
Collaborator
Jump to solution

R80.40 JHF 120 - S2S VPN issue

Hey Lads,

 

A customer installed JHF 120 this morning and many S2S vpns didn't come up.

Solved it by reverting to JHF 118.

 

 

Am about to raise to TAC for them to have a look at the debugs but wanted to just give the heads up to the community.


Juan

74 Replies
Pawel_Szetela
Contributor

Hello,

We have another issue with S2S between Check Point GWs - routing traffic from satellite GWs (1550) through center to Internet stopped working after applying Jumbo 125 to central GW. Access to internal resources is working fine. Reverting to previous Jumbo (in our case 102) fixed this issue.

Regards,

0 Kudos
luthfi_rahman
Explorer

hi
we had the same issue

turns out, there is kernel parameter fw_ha_vpn_handle_becaming_ready

you can ask TAC for detail,
it works on my VSX VSLS environment

0 Kudos
genisis__
Leader Leader
Leader

Do you know the SK reference for this?  If its a fix/workaround to a bug then an SK should be published.

0 Kudos
PhoneBoy
Admin
Admin

I don't see even an internal SK for this issue and only a single mention in a TAC case...one that specifically mentions this thread, in fact.
Agree an SK should exist for this.

0 Kudos
genisis__
Leader Leader
Leader

I searched for "fw_ha_vpn_handle_becaming_ready" on the support site, and nothing comes up,  are we able to get a SK raised for this? 

0 Kudos
idants
Employee
Employee

Hi,

In case you upgrading only 1 member to take > 120, and the other member still run with take < 120, there might be a VPN outage once failover to take > 120.

there are 2 options:

1. Change the global variable as written below

2. Upgrade the 2 members to take > 120 (which is the common use case)

The problem won't occur in the future, we had a change in take 120 which cause it.

Thanks,

Idan Tsarfati

IPsec VPN R&D group manager

0 Kudos
genisis__
Leader Leader
Leader

Idan - can you confirm the SK related to this issue?  I would expect one to be public available so that the procedure for workaround is documented.

Additionally is this being seen as a bug, if so when will it get fixed?

0 Kudos
idants
Employee
Employee

It was already fixed. There is one time issue as I described above.

0 Kudos
Kim_Moberg
Advisor

Hi @idants 

When you say fixed in newer version does that mean Multi Cluster Version  (mvc). ?

I think I have used in CLI the command “cphaconf mvc on” 

 

 

Best Regards
Kim
0 Kudos
Tobias_Moritz
Advisor

@idants From your description, it looks like this VPN outage is occuring one time for every customer who updates a HA cluster to JHF T120 and above, because the usual procedure is:

  • update standby node to new JHF
  • failover to updated node (by clusterXL_admin down on active node)
  • update other node

If this is the case, than this problem should really be documented in a public sk and JHF release notes should link to it.

You said "Change the global variable as written below". Where is it written below? I cannot find it in this thread at the moment. There is the "fw_ha_vpn_handle_becaming_ready" mentioned here but not which value is default and to which value it should be set to workaround the problem and what should be done with this variable after both nodes are up with T120.

Just to avoid confusion: We are not talking about the change in narrowing feature (sk166417, which was the beginning of this thread) anymore, right? Or do we?

I'm also wondering why you say this is a specific T120 problem, because when looking at the change log in sk165456, there is nothing VPN related in T120. There are numbers of VPN fixes/changes in T119. Did you mean T119 when you said T120 or are there undocumented changes in T120?

Sorry, but I really have to ask these specific questions, because we had more than enough VPN outages this year due to new bugs introduced by the large number of VPN fixes in the JHFs this year, including various private fixes which broke more than they fixed or even broke VPN completely (vpnd crash loop).

0 Kudos
Juan_
Collaborator

Hey Tobias,

Title of the post here is 120 because it was the GA (and not 119) but yes, the VPN changes are in119.

edit: I'd be interested in knowing whats "fw_ha_vpn_handle_becaming_ready" about as well.

 

Juan

0 Kudos
genisis__
Leader Leader
Leader

Hence this is why we are asking for an SK, which clearly does not exist, and I did someone not mention they are see the same issue on JHFA125?

To me this sounds like a bug.

 

idants
Employee
Employee

Hi,

We will create SK on Sunday.

To make it more clear in the meanwhile:

1. We are not talking about the narrowing issue which started this thread - Regarding the narrowing issue, there is a fix which still not part of the JHF (will be part of the next one, follow the JHF SK). If you need the fix (according to my guidance in my last respond on it), please contact TAC to get it - only customers who have narrowed tunnels.

2. Since there was a change in a VPN table in take 119, upgrade to this take *might* cause some outage on some tunnels when 1 member is running with take >119 and the other member with take < 119.

In order to overcome it, there are 2 options:

A. Upgrade both of the members at the same time during maintenance window

B. Add to fwkern.conf the following line:

fw_ha_vpn_handle_becaming_ready=1 

and run this CLI - fw ctl set int fw_ha_vpn_handle_becaming_ready 1 on both members

This is not a bug which needs to be fix - future updates after that won't need this procedure anymore (when the initial state is both members running with take >= 120, prior to the upgrade).

I hope this is more clear now.

Please let me know if more details are needed.

 

Thanks,

Idan Tsarfati.

IPsec VPN R&D group manager.

0 Kudos
genisis__
Leader Leader
Leader

Thanks Idan - when you say it a fix for the narrowing issue is going to be implemented in an upcoming Jumbo can you confirm the bug id related to this so we can look out for this, and insight as to when the jumbo is scheduled would be good.

 

Option two sounds like a separate issue, I assume this is what the SK will be related to?  If both members are running take 125 then the above procedure is not required, correct?

Do we have a similar issue when running R81 or R81.10? 

0 Kudos
idants
Employee
Employee

No, this speicific problem doesn't exist in R81 and R81.10.

 

The instructions which will appear in the SK:

o   When Site-2-Site tunnel is NAT-T and one of the sites has Cluster Gateway when one Member is R80.40 with Jumbo            take < 119 and the another with >=119 - the tunnel will be down until resetting the relevant info.

o   Workaround

          *  Delete the relevant entry in ‘orig_route_params’ – fw tab -t orig_route_params -x -e “<peer_ip,0,0;”

          *  Reset the vpn tunnel – vpn tu del <peer_ip>

o   The issue will be resolved automatically right after upgrading the second member to the same take or higher.

0 Kudos
genisis__
Leader Leader
Leader

Thanks.  Please let the community known the SK reference when released.

0 Kudos
idants
Employee
Employee

sk175824 - will be available in the next few hours.

0 Kudos
Tobias_Moritz
Advisor

Thank you for writing and publishing the sk175824.

If I did not understand you wrong, both the workaround and the solution means a short outage of affected VPN tunnels when updating a Gateway HA Cluster. Is there a way to do this without outage? Maybe not and we have to accept it. I'm just asking, because the workaround with kernel parameter fw_ha_vpn_handle_becaming_ready you mentioned before is not mentioned in your new sk anymore.

0 Kudos
Juan_
Collaborator

This makes sense with what happened in my case.
Though we haven't yet tried the upgrade again.

0 Kudos
Stefano_Marchet
Participant

Hi,

also our customer verified that after installing from JHF Take 120 for R80.40, some VPN issues started; sometimes he needs to reset a VPN tunnel to bring up it correctly.

Previously, with JHF Take 118 for R80.40, no VPN issues was presents.

I don't found any "sk175824"...any news for it?

I saw many fix published for VPN blade in outcoming JHF Take 126 (now in "ongoing" state), but I don't understand if installing this JHF will fix customer problem.

 

Regards,

Stefano Marchetti

0 Kudos
Juan_
Collaborator

Hi Idants,

3 questions as am about to upgrade a cluster from pre 118 to 125.

  • Seems like the sk has been made internal?
  • When executing this command: fw tab -t orig_route_params -x -e “<peer_ip,0,0;”
    • Is the IP to be entered in Hex?
    • Would this be a valid syntax?
      • fw tab -t orig_route_params -x -e “1.1.1.1,0,0;”
    • Can i just nuke the whole table?

Thanks

0 Kudos
Alex-
Advisor
Advisor

Hello @idants ,

I have some implementations with lots of narrowed tunnels which has probably always been the case since a number of them were created a while ago in different versions, with other peer vendors and so on. I don't really think I could reconfigure them all to be not narrowed anymore as there are constraints like availability, coordination with 3rd parties and the like.

Do I understand correctly from your point 1 that in a JHF after Take 125, something will be implemented so they don't break upon upgrade?

I'm currently holding onto Take 118 for two things, Full HTTPS inspection on VSX seems to have an issue as of Take 119/120 (SR open) and this VPN narrowing subject. I'd rather upgrade some of my environments using these two features when both solutions are identified. 🙂

 

Kind regards,

Alex

0 Kudos
idants
Employee
Employee

You can ask TAC for a port-fix for the take you are going to upgrade to until this fix will be integrated into the JHF (fix for narrowed tunnels).

Naama_Specktor
Employee
Employee

Hi 🙂

I will appreciate it if you will share SR # from TAC, you can also share via PM ,

thank you!

Naama 

 

0 Kudos
Pawel_Szetela
Contributor

Hello again,

As I wrote before we have another issue with S2S between Check Point GWs - routing traffic from satellite GWs (1550) through center to Internet stopped working after applying Jumbo 125 to central GW. Access to internal resources is working fine. Reverting to previous Jumbo (in our case 102) fixed this issue.

With new JHA 139 problem is still there.

Does anyone here have same issues?

Regards,

0 Kudos
genisis__
Leader Leader
Leader

We are using JHFA139 on VSX and have no issues with S2S VPNs, and we have a mix of CP and third-parties, additionally a mix of IKEv1 and IKEv2 tunnels, but I can't say if out topology and setup is exactly the same as yours.

It does sounds like you need a TAC case, especially if it works with an older Jumbo.

0 Kudos
Pawel_Szetela
Contributor

We are using IKEv2 and the only problem with newer jumbos is routing to Internet through center gateway. All other traffic to internal networks is working fine.

0 Kudos
Tobias_Moritz
Advisor

As I already said, we have VPN issues with almost every new JHF and TAC agrees in phone calls, that IKEv2 is still not a stable feature on Check Point gateways after all the years, this protocol is standard and widly used. When you look at the change logs of JHFs, you see various VPN fixes and some of these fixes introduce new bugs.

Are you using IKEv2? While we do not have your architecture so I cannot tell if we would hit the same problem like you, we are hitting a problem with TCP and UDP traffic over route based IKEv2 VPNs to AWS. ICMP is working. Older codebase (JHF T102 plus hotfixes) works too.

TAC case is opened of course, but progress is very slow.

idants
Employee
Employee

Hi,

We worked very hard to improve the quality of IKEv2 during 2021 and I can say that it is much more stable and we now see very few new problems with IKEv2, some of them are configuration issues.

Not sure who in TAC said it, but it is incorrect.

Thanks,

Idan Tsarfati

IPSec VPN R&D group manager.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events