Destination NAT:ed TCP traffic on R80.40 VSX has a...

Albin · ‎2021-08-23

Hi,

We have two customers who have the same issue on R80.40 VSX gateways. Both are running take 118.

The issue is with Destination NAT:ed TCP traffic (UDP not tested), specifically on a R80.40 VSX Gateway. Both customer has run R80.10 before where the issue was not present.

Traffic has a delay of 7 seconds for both customers, but the behavior in PCAP was slightly different, probably due to some SecureXL calculations. For one customer, the SYN does not pass the firewall at all. On the 4th SYN attempt from the client (7 seconds in), it finally passes through the firewall & syn-ack is received. Everything works well after that.

For the other customer, the destination NAT:ed IP is on a DMZ which is in the same range as the original destination. This destination NAT rule has a specific port as a criteria. Example: Original destination is 1.1.1.61 port 50000, translated destination is 1.1.1.62 port 50000. The PCAP shows again that on the 4th SYN, the traffic finally works. The big difference for this PCAP is that traffic is actually translated and sent out, but to the wrong MAC address. It is sent to 1.1.1.61's host MAC address, at the 4th SYN packet, it is sent to the correct MAC address.

We've debugged the issue and we can see in the Debug that on the 4th packet, SecureXL updates the MAC addresses and after the traffic flows.

Turning SecureXL off, or only for the specific traffic helps & make the session establish fast as expected.

We've got tickets open to TAC but have not received any conclusive answers, it surprises me as I would assume alot of people should have this problem. Hence I wanted to check with the community if anyone else has experienced this issue & possibly have received any fix?

_Val_ · ‎2021-08-23

Do you have NAT templates enabled?

Albin · ‎2021-08-23

Yes. However, we did try to disable it on one of the customers, it did not help.

_Val_ · ‎2021-08-23

Understood. Manual or automatic NAT?

Albin · ‎2021-08-23

Both are using Manual NAT.

_Val_ · ‎2021-08-23

Sorry for so many questions, I am just trying to figure out the best way of searching for these issues. So far, I do not see anything similar anywhere, neither in the community, nor in the TAC stories.

So, after looking at your tickets, I have even more questions.

1. Upgrade in place or re-install?

2. Same HW or also changed

3. Both on open server or also on CP appliances?

4. Do you have HT enabled?

5. Both customers, do they have similar topologies on their VSX?

6. VSLS or HA? Any virtual routers/switches on affected systems?

If this is too much to answer here openly, please feel free to send me via PM or even via email to vloukine@checkpoint.com

Tobias_Moritz · ‎2021-08-25

We got a similiar issue after upgrading from R80.40 JHF T102 to T118.

Symptoms were exactly the same like you discribed. Debugging showed, that the first three SYN packets of every TCP connection were routed to wrong outgoing interface by ppak (Secure-XL). That way, they got no answer of course. The fourth packet was routed correctly by ppak and after that, the TCP handshake completes and the connection was working from that point on. This problem affected every TCP connection on our gateway, which was using destination NAT and route based VPN and using NAT-T for this tunnel. We could implement a workround by not using destination NAT for this specific connection (was possible in our special case), so we did not rollback the JHF.

We opened a TAC case and got told (by a T3 engineer) that this issue is covered by PRJ-15569 (even when its description does not match perfectly) and that our problem should be solved after updating to T120. We have scheduled this for Aug 31th so I cannot tell you now if the TAC engineer was right. Maybe you can update to T120 in your environment earlier and share the results here.

Albin · ‎2021-08-25

Thanks, very useful information.

I don't think we will be able to secure a MW before 31 Aug. Please let us know the results of your upgrade.

As a warning, you should be aware that Take 120 has had some issues for VPN so check this out before you upgrade

https://community.checkpoint.com/t5/General-Topics/R80-40-JHF-120-S2S-VPN-issue/m-p/125884

& sk166417

Tobias_Moritz · ‎2021-09-01

Thanks for the link, but fortunatly, we are not affected by this new feature.

We were asked by CP TAC to postpone the installation of JHF T120 due to concers about other IKEv2 issues we are discussing with TAC. Especially with a new bug introduced by a private patch we got for T118 (and which we uninstalled because of this bug) while this private patch was integrated in T119 and they want to check if this new bug was also integrated 🙂

R&D gave special love to VPN code base this year and so they fixed a large number of issues and improved things. Unfortunatly, they also broke a lot of things which were working before with these large amount of code changes over the last few GA jumbo takes. And even private patches took up to four interations of fixing the fix until they really fixed the issue the were made for without breaking other things we are using on the same gateway.

TLDR: I cannot update you today with our results.

Albin · ‎2021-09-01

Thanks for letting me know!

I can update you that we have a customer (without VSX ) who has the same issue with traffic delay of 7 seconds on destination NAT traffic toward DMZ. Take 120 did not help there. It seems it might not be VSX specific, rather SecureXL + Inbound DNAT specific?

I know your traffic was related to VPN too, ours are not. Simply Destination NAT into a DMZ.

Tobias_Moritz · ‎2021-09-02

Very usefull update, thank you. In our case, only routed based VPN with destination NAT is effected, not policy based VPN with destination NAT and also not non-VPN traffic to DMZ with destination NAT. Strange...

But yes, the problem is obviously in PPAK (SecureXL) area.

Any usefull update in your TAC case?

Albin · ‎2021-09-02

Not yet 😞

Albin · ‎2021-09-06

We have received an update to try # run fw ctl set int cphwd_routing_interval 0 (default is 5)

One customer has tried and it resolved the issue. We will now receive a hotfix.

The other customer we're scheduling to try it, but might take a while. When I google it, there is a SK for it, but it is hidden when I try to go to it. You might want to give it a shot! just remember if it does not help, change it back to 5. Or ask your TAC engineer about this value.

AaronCP · ‎2021-11-23

Hi @Albin,

I recently started to experience this issue, too. We upgraded our non-VSX gateway to R80.40 T125 on Sunday 14th November. A couple of days later, we started to experience delays (a 10-12 second delay in our outbound SIP traffic, as well as a 7-8 second delay in traffic on web traffic).

The traffic has a destination NAT configured, and both servers are in our DMZ.

As soon as I disable SecureXL, the traffic connects instantly.

Did TAC give any further information on this issue? I can't see it listed in any of the ongoing takes. Will they provide a hotfix if setting the cphwd_routing_interval to 0?

Thanks,

Aaron.

Albin · ‎2021-11-24

Hi,

We got the following hotfix with T120:

Name: fw1_wrapper_HOTFIX_R80_40_JHF_T120_919_MAIN_GA_FULL.tgz
Size: 86152160 bytes (82 MiB)
MD5: 534fbe75d80fd9cb32c4159b8d190cb7

After install the following parameter should be 1:
After upgrading the JHF and installing the HF please make sure that the following parameter is set to 1, with running the following command:

# fw ctl get int enable_calc_route_wrp_jump

If it's not on 1, then please run:

# fw ctl set int enable_calc_route_wrp_jump 1

We did not install it, because we've seen issues with T120, we were going to get it ported to T125 customer decided to wait for Jumbo integration instead where it will be added under PRJ-30818 or PRHF-19417.

Hence, I never tested it but you can request it too. Customer is still running with f2f exceptions in table.def as a workaround, since we only had the issue on one traffic flow it was not too big of a deal.

Are you a member of CheckMates?

Destination NAT:ed TCP traffic on R80.40 VSX has a delay of 7 seconds when SecureXL is enabled