Because I was the one who started venting here, triggering the escalation chain which resulted in posts @rbrannoc and even of @eitanlu , I want to provide feedback after the problem was finally resolved.
I do not like forum threads who describe problems but never get updated when the solution was found, so here I am with my post 🙂
After the ecalation kicked in, we got some progress in our TAC case and after a while, it was confirmed that there is no configuration issue on our side and we found two workarounds: Disabling SecureXL or adding routes for pre-NAT-destination to the after-NAT-destination interface.
However, it took until June 10th (which is 4,5 month after case opening) until we finally got a hotfix (fw1_wrapper_HOTFIX_R80_40_JHF_T158_465_MAIN_GA_FULL.tgz) which delivered a new kernel and fixed the bug. Issue-ID is PRHF-24166 if you want to track it in future JHF release notes.
Problem summary: Traffic applicable for route based IKEv2-VPN is not picked up for encryption when DNAT and SecureXL is in use and routing decision for pre-DNAT is not the same as for post-DNAT.
Some more detailed feedback as requested:
Before we finally got through to R&D, we got some extensive debug plan from TAC which did not make sense at all from our perspective for the specific problem the support case is about. We told our opinion to the TAC engineer and got this response:
"I completely understand that the suggested procedure of collecting the debug won't be possible to isolate the issue however, I need to follow the Check Point procedure in order to involve RnD."
And this extensive debug plan looked like it was written many years ago, containing deprecated commands and was not setup correctly for the case:
- fw monitor -e instead of fw monitor -F and not disabling of SecureXL but this case was about a SecureXL problem
- fw monitor filter string did not take NAT into account, but this case was about NAT
- tcpdump instead of cppcap (sk141412)
- missing tcpdump interface filter on a productive gateway is a bad idea also
- tcpdump filter did not take second tunnel into account, but this case was a about a vpn tunnel as described in sk100726 which always contains two tunnels
- log collection task only collects IKEv1 logs, but this case was about IKEv2
Later on, we got a debug plan with tcpdump again (this time with a totally wrong syntax) and we should run a manual kernel debug (with a bunch of flags including drop) and the macro fw ctl zdebug + drop in parallel. We told the TAC engineer that this is not possible, as the macro also does a kernel debug and there can only be one at a time. We were told we are wrong and we should run it that way. That we showed multiple times before (written and during zoom sessions), that there are no drops is the minor problem here. You know it already: we were not wrong, but the TAC engineer was.
For a case which was already on high escalation level, this did not provide a good customer experience.
Later on, we got a new escalation engineer who handled the case together with R&D and from that on, the problem was understood (and even isolated in the debug logs) and R&D started working on a fix.