Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Juan_
Collaborator

R80.40 JHF 120 - S2S VPN issue

Jump to solution

Hey Lads,

 

A customer installed JHF 120 this morning and many S2S vpns didn't come up.

Solved it by reverting to JHF 118.

 

 

Am about to raise to TAC for them to have a look at the debugs but wanted to just give the heads up to the community.


Juan

74 Replies
the_rock
Champion
Champion

Is there an official sk stating all the changes/improvements?

0 Kudos
rbrannoc
Employee
Employee

Hi @Tobias_Moritz  & D_W, 

My name is Richard Brannock, Director of the Dallas TAC.  Feel free to reach out to me via PM to discuss any support concerns.  In the event an issue is moved off the tracks at any point in time, feel free to leverage the Escalation Path/Matrix to assist.

https://www.checkpoint.com/support-services/check-point-tac-support-escalation-path/

All the best,

Richard

Tobias_Moritz
Advisor

Hello Idan,

thank you for your response. I know, its hard to see when customers are venting in public forums, like I did here. I know it was not so nice. Personal experience with various VPN cases in 2020, 2021 and now 2022 resulted in this post.
I appreciate your way of responding to this venting. I saw that your team invested much effort in improving the VPN codebase in the last years. However, there were many new bugs introduced by this (clearly visible in JHF release notes) and when checking the most recent release notes of R80.40 JHF T150 from Jan 19th 2022 for example, there is again a large list of VPN fixes.

Most of the negative feelings my team developed, were because of the way how these problems were addressed in TAC. How long does it take until the support case reaches someone, who really understands the problem. How many remote sessions do we need where we just repeat what's already in the support case description from the beginning on. How many weeks do we have to wait until we finally get attention from R&D. All but one our cases in the past were solved by a hotfix provided by R&D while these fixes were integrated later in regular JHF. One was solved by clearing some kernel table entries manually (or full outage reboot), because there was a change in table structure by some JHF and update scenario was not fully working. So no configuration issues on our side. And yes, TAC told us to switch to IKEv1 in more than one support case, because problems with IKEv2 are known. Only when we insisted that IKEv2 has to work, they continued working on the case. The case with the kernel table (orig_route_params) was a special experience with TAC. TAC told us our ISP is the reason why some VPNs did not work after updating the JHF. We needed to escalate this SR on multiple levels to get past this TAC engineer and get R&D involved. After R&D was finally involved, we got the answer within hours: we needed to purge some entries from this table because entries were corrupted by JHF installation (yes, even after all cluster nodes were updated to the same JHF take (sk116453)).

I've shared the SR now with Naama Specktor via PM. Really appreciate the offer to help. Thank you.

genisis__
Advisor

This is really interesting because  I have had the exact same experience in terms of resolution of a problem (not VPN related).

In my scenario I have a case that is running well over 1 year now, and only recently did R&D get on a zoom, over a period of two sessions (I would say totalling 3hrs) R&D found what we believe is the problem, hotfix released within a week (it was a problem they had seen!)

From a business point of view the escalation process within Checkpoint needs to be reviewed. 

Step 1 TAC handles a case for a maximum of 2 weeks, if no progress is made send directly to CFG, no questions.

Step 2 CFG handles a case for a maximum of 2 weeks and R&D must be engage in the case and review the problem via a zoom session. 

Step 3 R&D drives the case with CFG taking a back seat.  

Additionally if an issues cannot be resolved within a reasonable time frame for the complexity of the problem Checkpoint should be held to SLA in the form of financial penalties.

I could not calculate the amount of money lost over the year in terms of service interruptions, outages and the amount of man time spent just driving and trying to work through this one problem.

 

 

 

Kim_Moberg
Advisor

@genisis__ 

I got the same experience. 

I think now at Covid-19 time TAC are almost working from home and out of reach from their technical team leader who can contact RnD or they are not well trained enought and let alone.

I got similar issues with Cloud and Maestro and it can be very frustrating when they are circle around without assisting..

Thanks

Kin

Best Regards
Kim
the_rock
Champion
Champion

@Kim_Moberg ... you described it perfectly. But, I guess this is topic probably not for this thread : - )

0 Kudos
genisis__
Advisor

agreed - at least we are all on the same page 😉

Pawel_Szetela
Contributor

Hello Tobias_Moritz,

Sorry to say that our experience with TAC is exactly the same.

 

D_W
Advisor

Sorry to tell here as well and sorry to hijack this thread - the longest and biggest issues we had and have so far are VPN related 😞

We're currently facing an issue where TAC told already once it is ISP related (in fact on that related sites all other VPNs work well in the same community!) and our supplier proved again that it must be CP related and now it seems R&D is involved but this case already open since Oct 2021!

Other VPN issues in the past involved SMB (14xx series) devices and there suddenly the VPN between Full Gaia and the embedded device(s) stopped working. Workaround was rebooting/policy installing and praying that this will fix it - sometimes it didn't but after some waiting it self healed... that issues didn't show up in the last months when we got rid of some of the SMB devices and newer versions on the CPs on the side with full Gaia. Once we never become it fixed completely and we just start up an IPSec Tunnel on Linux VMs between the faulty sites - then we replaced the SMB device with a full Gaia device - issue gone. Frustrating.

However to sum it up - TAC cases takes sometime really long (not only VPN related 😉 ). As a customer i say if we would not have a good supplier to handle this for us we would look after other solutions. But I also still defend CP products (not all 😉 ) because I very much like the solution!

Cheers,
David

I recommend board moderators to create separate "CheckMates vs. TAC" thread and may be another one "Vent off some steam here if you are mad at CheckPoint😀

But, come on... TAC a nice guys that need to follow strict procedures and try to solve your problem as much as possible because their bonus depends on it. Plus they are sometimes just humans like everyone else.

It is kind of, you need to prove to CheckPoint your problem is a BIG one before R&D is engaged. And that takes time.

I have worked in R&D department for a large company. I know how annoying it is to pull you out from your tight schedule to investigate customer problem. It is like that everywhere. Try to get Microsoft R&D attention for example. Good luck with that 😀

the_rock
Champion
Champion

I definitely see your point of view @HristoGrigorov . IT is not an easy gig, everyone knows that...all we can do is do our individual part and help one another. Sadly, COVID19 made that so much more difficult, since its all remote work, but where there is a will, there is always a way! Now, on to helping : - )

0 Kudos
eitanlu
Explorer

Hello dear Check Mates,

My name is Eitan, i am the VP of Technical Service at Check Point.

I read this thread, and your honest feedback is very much appreciated. To make it a constructive feedback, i am offering you my email address, which you can reach out to me directly and share the case details, number or any feedback. I promise to get back to you promptly. eitanlu@checkpoint.com

thank you,

Eitan

 

0 Kudos
Tobias_Moritz
Advisor

Because I was the one who started venting here, triggering the escalation chain which resulted in posts @rbrannoc and even of @eitanlu , I want to provide feedback after the problem was finally resolved.

I do not like forum threads who describe problems but never get updated when the solution was found, so here I am with my post 🙂

After the ecalation kicked in, we got some progress in our TAC case and after a while, it was confirmed that there is no configuration issue on our side and we found two workarounds: Disabling SecureXL or adding routes for pre-NAT-destination to the after-NAT-destination interface.

However, it took until June 10th (which is 4,5 month after case opening) until we finally got a hotfix (fw1_wrapper_HOTFIX_R80_40_JHF_T158_465_MAIN_GA_FULL.tgz) which delivered a new kernel and fixed the bug. Issue-ID is PRHF-24166 if you want to track it in future JHF release notes.

Problem summary: Traffic applicable for route based IKEv2-VPN is not picked up for encryption when DNAT and SecureXL is in use and routing decision for pre-DNAT is not the same as for post-DNAT.

 

Some more detailed feedback as requested:

Before we finally got through to R&D, we got some extensive debug plan from TAC which did not make sense at all from our perspective for the specific problem the support case is about. We told our opinion to the TAC engineer and got this response:

"I completely understand that the suggested procedure of collecting the debug won't be possible to isolate the issue however, I need to follow the Check Point procedure in order to involve RnD."

And this extensive debug plan looked like it was written many years ago, containing deprecated commands and was not setup correctly for the case:

  • fw monitor -e instead of fw monitor -F and not disabling of SecureXL but this case was about a SecureXL problem
  • fw monitor filter string did not take NAT into account, but this case was about NAT
  • tcpdump instead of cppcap (sk141412)
  • missing tcpdump interface filter on a productive gateway is a bad idea also
  • tcpdump filter did not take second tunnel into account, but this case was a about a vpn tunnel as described in sk100726 which always contains two tunnels
  • log collection task only collects IKEv1 logs, but this case was about IKEv2

Later on, we got a debug plan with tcpdump again (this time with a totally wrong syntax) and we should run a manual kernel debug (with a bunch of flags including drop) and the macro fw ctl zdebug + drop in parallel. We told the TAC engineer that this is not possible, as the macro also does a kernel debug and there can only be one at a time. We were told we are wrong and we should run it that way. That we showed multiple times before (written and during zoom sessions), that there are no drops is the minor problem here. You know it already: we were not wrong, but the TAC engineer was.

For a case which was already on high escalation level, this did not provide a good customer experience.

Later on, we got a new escalation engineer who handled the case together with R&D and from that on, the problem was understood (and even isolated in the debug logs) and R&D started working on a fix.

the_rock
Champion
Champion

Excellent point about sharing a solution, I always try to follow the same example @Tobias_Moritz 

Naama_Specktor
Employee
Employee

Hi @Tobias_Moritz 

My name is Naama Specktor and I am from check point .

I will appreciate it if you will share TAC SR #.

 

thank you,

Naama

0 Kudos