NFS Issues

Mark_Shaw · ‎2021-06-09

Does Checkpoint and Unix NFS mounts generally not work well together?

Reason I ask is we continually see issues if we have to failover or reboot our firewalls that it breaks NFS connections between hosts and we have to manually clear the connection table, which involves converting decimal to HEX, etc so not ideal. This has been happening with our on-prem firewalls that are running Maestro and R80.20SP for a while. Well I say a while the issues disappeared and has now reared its head again. We have a case open with TAC and they have a load of debug data.

Now the issues have occurred with our Azure IaaS Firewalls, we had a memory spike yesterday which we think is down to us under spec’ing the VM’s but it broke some NFS connectivity and last night when we added more memory to the VM’s and installed the latest on going HFA it broke NFS again when failing the firewalls over. Again we had to manually clear the connections table.

It just seems strange that we have seen the issue on two different firewall architectures

On-Prem – Running Maestro and R80.20SP latest HFA

Azure IaaS – Running a IaaS HA Cluster on R81 latest ongoing HFA (take 29).

Timothy_Hall · ‎2021-06-10

Any chance the associated RPC protocol is using TCP for the transport instead of the standard UDP? See here:

sk44982: TCP based RPC traffic is dropped by the Security Gateway.

Also what is the L4 protocol in use for NFS itself, TCP or UDP? UDP itself is stateless of course, but the firewall still tracks "state" for that traffic for purposes of stateful inspection. The Virtual Timeout enforced for NFS over UDP is quite short compared to TCP and may be part of your problem. If possible can you check the L4 transport for your NFS mounts and see if there is any difference in the problematic behavior between TCP and UDP mounts? If only UDP is in use, can you create a test mount with TCP and see if it experiences the same issues?

The fact that this seems to happen upon failover suggests that not everything is being sync'ed between the members; this is expected behavior to some degree as any inspection operations occurring in process-based security servers on the firewall are generally not synced between the cluster members.

Threat Prevention does perform some of its operations in security server processes, whose inspection state will be lost upon failover. So the next thing I would suggest if you are using any of the 5 TP blades is to create a "null" TP profile rule at the top of your TP policy (the null profile has all 5 TP blades unchecked), and apply it to all traffic to and from your NFS server in the Protected Scope. Once that is in place see if the failover situation for NFS traffic improves, if it does you may need to tweak your TP policy and how it handles NFS traffic as leaving the null profile in place indefinitely is not a good idea.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Mark_Shaw · ‎2021-06-16

Hi Timothy

Many thanks for you reply.

All of the NFS mounts are using TCP/2049 for the connections.

Tonight we implemented the null TP policy with the NFS servers in the protected scope. We performed a failover by issuing clusterxl admin down and everything was fine when we failed to/from the primary gateway.

We then performed a hard reboot of the primary firewall and this is when things broke again, NFS mounts for three hosts stopped working. We had to convert IP to hex and dec to hex to clear the connections table.

Any further ideas? Not getting anywhere quickly with TAC and these Unix guys are wanting me to look at Azure firewalls instead.

Timothy_Hall · ‎2021-07-12

@Mark_Shaw interesting that the TP null profile seemed to help with "soft" failovers but not "hard" ones.

So someone else saw my response here and asked me if there were any updates, as they are seeing a similar issue with NFS and failovers. Here is my reply to them and I hope it is helpful, please note I am making some serious educated guesses here and doubt the following is 100% correct:

After thinking about this, I can think of two main things that are different between a "soft failover" (clusterXL_admin down) vs. a "hard failover" (rebooting or pulling power) on the active member.

1) There is a 2-2.5 second "dead timer" interval that happens with a hard failover where all traffic is lost/stalls, this usually manifests itself as losing 1-2 continuous ping packets. A soft failover is nearly instantaneous with the new member not needing to wait the dead interval to take over. In my experience a 2-2.5 second interruption in traffic during a hard failover is not enough to break stuff.

2) With a soft failover, state sync is maintained continuously between the two members as they trade active/standby roles. With a hard failover it stops suddenly and some sync updates may be lost, this is most likely to affect traffic that is very chatty which NFS certainly is. Earlier they mentioned having to clear entries in the connection table to get things working again on the newly-active member which tracks with this theory, and that they didn't have to clear connections with a TP null profile in place after a soft failover.

It is almost like some sync updates were lost during the hard failover, and the difference in some kind of sequence number or perhaps an anti-replay feature kills the the NFS traffic through the newly-active member, as the difference is outside some kind of acceptable window for inspection. This seems much more likely than scenario #1.

Suggestions:

1) Assuming the null TP profile is in place, try adding an Inspection Settings exception for the NFS hosts for Protection "Any". This is very different from a null TP profile. If the NFS traffic is fragmented, it could be running afoul of the IP Fragments Inspection Settings protection or possibly some of the TCP ones like Invalid Retransmission, Out of Sequence, etc.

2) It might be interesting to use fast_accel to fully accelerate the NFS traffic inside SecureXL and see if that impacts the NFS issue upon hard or soft failover. Because there is very little sync'ing of state between SecureXL/sim on cluster members, there may be more relaxed inspection windows present there or even some kind of recovery mechanism for handling this situation.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Mark_Shaw · ‎2022-01-21

Good morning

Firstly apologies for the lack of response. Our issues has not gone away, last night the team upgraded the Azure gateways R81 JHF T44 and upon failover, over 100 NFS mounts failed to reconnect and the same when we failed back

AaronCP · ‎2021-09-06

Hey @Mark_Shaw,

We're experiencing a very similar issue with our NFS shares, too. We have multiple Linux VMs running in Azure and if the NFS share is mounted at the time of a failover, the NFS share "hangs". This would occur when either the On-Prem or Azure IaaS firewalls failed over. I managed to prevent the issue occurring when the On-Prem firewalls failed over by adding the NFS destination port of 2049 into the user.def file on the management server, but we're still experiencing the issue in Azure. I was wondering if TAC had given you any suggestions?

We have a similar architecture to yourself:

On-Prem - 15600 Quantum running R80.40 T120

Azure - CloudGuard running R80.40 T120

Thanks,

Aaron.

Mark_Shaw · ‎2022-01-21

Good morning

Firstly apologies for the lack of response. Our issues has not gone away, last night the team upgraded the Azure gateways R81 JHF T44 and upon failover, over 100 NFS mounts failed to reconnect and the same when we failed back

Have you had any further luck?

AaronCP · ‎2022-06-10

Hey @Mark_Shaw,

Still no luck! We have recently moved away from the CloudGuard solution to Azure firewall (we still have the on-prem Check Point), but we still get the issue.

If there is any interruption to the connection at the time the NFS share is mounted to the Linux VM (whether a blip in the VPN tunnel connectivity, or a gateway failover), we encounter the issue.

If we don't spot the issue quickly enough, the Linux VMs constantly try to remount their NFS share, which in turn causes the load average to increase dramatically. It then requires our Linux engineers to reboot the VMs overnight. This has happened every night this week!

AaronCP · ‎2022-06-15

Hi @Mark_Shaw,

I've been doing some research on this as our Linux engineers are rebooting servers of an evening on an almost daily basis. I found the following article online that explains the exact issue you & I are experiencing: https://www.suse.com/support/kb/doc/?id=000019722

The article references "Smart Connection Reuse" on "smart routers". I then located SK24960 which explains Check Points Smart Connection Reuse mechanism in more detail. I also found this old CPUG article that details an NFS issue and Smart Connection Reuse: https://www.cpug.org/forums/showthread.php/19664-smart-connection-reuse-NFS-%28maybe-AIX-flavored-NF...

It seems that when a client mounts an NFS share, it will always use the same source port for the connection, even if the connection is interrupted. Unfortunately we don't always immediately know when a server has been subject to this issue until the VM fails to back up. Another indicator is the load average increases and a reboot of the VM is the only way for us to "resolve" the issue. The SUSE article also mentions that a reboot is a sure way of terminating the NFS connection from the client.

As per SK24960, I have set the following kernel parameter on the gateway fw ctl set int fw_reuse_established_conn 2049 (as well as the SecureXL parameter for any accelerated traffic - fw ctl set int sim_reuse_established_conn 2049 -a

The SUSE article recommends to disable the Smart Connection Reuse functionality on the firewall, but I'm reluctant to do this until I know what the ramifications are.

I will keep you posted with any progress.

AaronCP · ‎2022-06-20

Hey @Mark_Shaw,

Quick update for you.

Prior to making the change, my colleague in our Linux team logged onto one of our Azure Linux VMs and mounted an NFS share from our on-prem NFS server. I then disabled the Smart Connection Reuse on-the-fly by setting the following parameter: fw ctl set int fwconn_smart_conn_reuse 0 (note we did not notice any traffic disruption by disabling this feature - but this may not be the case for your environment). After that, we reset the VPN tunnel at both ends (our on-prem Check Point and Azure Network Gateway) to cause a disruption to the connectivity.

My colleague in our Linux team confirmed that his NFS share remounted immediately. I have checked the connections table via fw ctl conntab and grepped for the NFS client network & NFS server and I am no longer seeing any SYN_SENT in the tcp state field.

We are going to leave Smart Connection Reuse disabled for a couple of days and monitor the load averages on our Azure Linux VMs. If we experience any connectivity interruptions with our Azure networks we can review if this change has "fixed" the issue.

I will keep you posted 😊.

Thanks,

Aaron.

Alexander_Wilke · ‎2022-12-14

Hello,

we have the same issue and discussed this for months or years with CheckPoint.

Fortunately we now got a fix announced (not tested yet) that solves a bug in SmartConnection Reuse.

Example

client --> FW-A --> FW-B --> Server

1.) Client and server communicate with NFS and have an established session

2.) FW-B is losing the connection state entry (Failover and or other reason)

3.) FW-B drops packets from server/client client/server

4.) Client after some minutes trys to establish a new NFS session with new 3-way-handshake but the same source port.

5.) FW-A does "SmartConnReuse" and modifies "SYN to ACK" because FW-A has state entry "established". FW-B drops the "modified "ACK" with "First packet isn't SYN"

Problem:

SmartConnectionRuese on FW-A resets the TCP Idle timeout for the established session every time there is a new SYN.

That's wrong. Ist must keep the ile timeout as it is unless there is a response to the modified ACK. To solve this you have to stop the NFS client to send new SYNs as long as the FW-A has the "established" state entry or you delete the entry manually.

The fix will allow a new kernel parameter to NOT refresh the idle timeout on an established session if a new SYN arrives:

This will make sure that established session may timeout if needed and not block new connections on the same source port forever - but for at least the idle timeout:

Fix IDs

R80.20sp - PRHF-26499
fw1_wrapper_HOTFIX_R80_20SP_T334_887_MAIN_GA_FULL

R84.40 - PRHF-26491
fw1_wrapper_HOTFIX_R80_40_JHF_T173_399_MAIN_GA_FULL

R81.10 - PRHF-26493
fw1_wrapper_HOTFIX_R81_10_JHF_T66_957_MAIN_GA_FULL

Parameter

syn_refresh_conn

PS:
However we do not know why NFS connections break after a failover. We have "keep all connections open" enabled globally and on the services but still not working reliable

Chris_Atkinson · ‎2022-12-14

How is IPS configured assuming it's enabled?

Refer: https://community.checkpoint.com/t5/Management/prefer-security-prefer-connectivity/td-p/164916

CCSM R77/R80/ELITE

Alexander_Wilke · ‎2022-12-14

Hi,

on ClusterXL environments it is "prefer connectivity" and on 64k R80.20SP scalable plattform I do not have this option in SmartConsole - probably because it is a "Single object" or "standard gateway" and not treated as a typical Cluster. Don't know how 64k works for IPS connectivity if there is a failover in the same chassis (from one SGM to another) or between different Chassis.

Jochen_Hoechner · ‎2022-06-21

Hi, did you try activating the kernel parameter fw_reject_non_syn 1 and perform a failover?

Best,
Jochen

Mark_Shaw · ‎2022-06-21

Hi
No we have not tried this, does this work?

we tried setting the following but still have a few issues

fwconn_smart_conn_reuse=0
fw_reuse_established_conn=2049

Jochen_Hoechner · ‎2022-06-21

It works.
The smart conn reuse is applying for 'active' connections. Especially idle connections do not survive a fail over, because the
chance of an 'aged out' session is high. In this case, sending a TCP-RST instead of DROP or an 'out of state' connection will help.

This kernel parameter is not recommended on Scalable Platform (Maestro and Chassis) gateways.

Best,
Jochen

Mark_Shaw · ‎2022-06-21

We are running Maestro 😥

AaronCP · ‎2022-06-21

Hey @Mark_Shaw,

Did you see any improvement at all? What issues (if any) did it help with and what issues still remain?

AaronCP · ‎2022-07-21

Hey @Mark_Shaw,

I wanted to keep you posted on my progress with this issue.

We took the decision to disable the Smart Connection Reuse feature. Whilst it improved the situation a little, we were still getting the problem. Our engineers went from rebooting multiple servers daily to one or two.

After some further troubleshooting, I found some drops whilst running fw ctl zdebug + drop | grep (IP of NFS server) and I saw drops with the message "SYN on established connection". This led me to SK174323.

I followed the instructions in the SK (a maintenance window is required) and we've not had a single Linux VM reboot in the past 3 evenings! It may be too early to tell if this has resolved the issue, but I will provide further updates in the coming weeks.

FYI - this is the syntax I used in the user.def.FW1 file on the SMS:

deffunc allow_syn_estab_count_rst() { ((dst = x.x.x.x) and (dport = 2049)) };

Hope this information is of use to you 😊

Vladimir · ‎2022-07-21

Huh... Interesting. Do you think it is a problem of specific NFS implementation?