Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Mark_Shaw
Explorer

NFS Issues

Does Checkpoint and Unix NFS mounts generally not work well together?

 

Reason I ask is we continually see issues if we have to failover or reboot our firewalls that it breaks NFS connections between hosts and we have to manually clear the connection table, which involves converting decimal to HEX, etc so not ideal. This has been happening with our on-prem firewalls that are running Maestro and R80.20SP for a while. Well I say a while the issues disappeared and has now reared its head again. We have a case open with TAC and they have a load of debug data.

 

Now the issues have occurred with our Azure IaaS Firewalls, we had a memory spike yesterday which we think is down to us under spec’ing the VM’s but it broke some NFS connectivity and last night when we added more memory to the VM’s and installed the latest on going HFA it broke NFS again when failing the firewalls over. Again we had to manually clear the connections table.

It just seems strange that we have seen the issue on two different firewall architectures

On-Prem – Running Maestro and R80.20SP latest HFA

Azure IaaS – Running a IaaS HA Cluster on R81 latest ongoing HFA (take 29).

0 Kudos
6 Replies
Timothy_Hall
Champion
Champion

Any chance the associated RPC protocol is using TCP for the transport instead of the standard UDP?  See here:

sk44982: TCP based RPC traffic is dropped by the Security Gateway.

Also what is the L4 protocol in use for NFS itself, TCP or UDP?  UDP itself is stateless of course, but the firewall still tracks "state" for that traffic for purposes of stateful inspection.  The Virtual Timeout enforced for NFS over UDP is quite short compared to TCP and may be part of your problem.  If possible can you check the L4 transport for your NFS mounts and see if there is any difference in the problematic behavior between TCP and UDP mounts?  If only UDP is in use, can you create a test mount with TCP and see if it experiences the same issues?

The fact that this seems to happen upon failover suggests that not everything is being sync'ed between the members; this is expected behavior to some degree as any inspection operations occurring in process-based security servers on the firewall are generally not synced between the cluster members. 

Threat Prevention does perform some of its operations in security server processes, whose inspection state will be lost upon failover.  So the next thing I would suggest if you are using any of the 5 TP blades is to create a "null" TP profile rule at the top of your TP policy (the null profile has all 5 TP blades unchecked), and apply it to all traffic to and from your NFS server in the Protected Scope.  Once that is in place see if the failover situation for NFS traffic improves, if it does you may need to tweak your TP policy and how it handles NFS traffic as leaving the null profile in place indefinitely is not a good idea.

New 2021 IPS/AV/ABOT Immersion Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
Mark_Shaw
Explorer

Hi Timothy

Many thanks for you reply.

All of the NFS mounts are using TCP/2049 for the connections.

Tonight we implemented the null TP policy with the NFS servers in the protected scope. We performed a failover by issuing clusterxl admin down and everything was fine when we failed to/from the primary gateway.

We then performed a hard reboot of the primary firewall and this is when things broke again, NFS mounts for three hosts stopped working. We had to convert IP to hex and dec to hex to clear the connections table.

Any further ideas? Not getting anywhere quickly with TAC and these Unix guys are wanting me to look at Azure firewalls instead.

0 Kudos
Timothy_Hall
Champion
Champion

@Mark_Shaw interesting that the TP null profile seemed to help with "soft" failovers but not "hard" ones.

So someone else saw my response here and asked me if there were any updates, as they are seeing a similar issue with NFS and failovers.  Here is my reply to them and I hope it is helpful, please note I am making some serious educated guesses here and doubt the following is 100% correct:

After thinking about this, I can think of two main things that are different between a "soft failover" (clusterXL_admin down) vs. a "hard failover" (rebooting or pulling power) on the active member.

1) There is a 2-2.5 second "dead timer" interval that happens with a hard failover where all traffic is lost/stalls, this usually manifests itself as losing 1-2 continuous ping packets. A soft failover is nearly instantaneous with the new member not needing to wait the dead interval to take over. In my experience a 2-2.5 second interruption in traffic during a hard failover is not enough to break stuff.

2) With a soft failover, state sync is maintained continuously between the two members as they trade active/standby roles. With a hard failover it stops suddenly and some sync updates may be lost, this is most likely to affect traffic that is very chatty which NFS certainly is. Earlier they mentioned having to clear entries in the connection table to get things working again on the newly-active member which tracks with this theory, and that they didn't have to clear connections with a TP null profile in place after a soft failover.

It is almost like some sync updates were lost during the hard failover, and the difference in some kind of sequence number or perhaps an anti-replay feature kills the the NFS traffic through the newly-active member, as the difference is outside some kind of acceptable window for inspection. This seems much more likely than scenario #1.

Suggestions:

1) Assuming the null TP profile is in place, try adding an Inspection Settings exception for the NFS hosts for Protection "Any". This is very different from a null TP profile. If the NFS traffic is fragmented, it could be running afoul of the IP Fragments Inspection Settings protection or possibly some of the TCP ones like Invalid Retransmission, Out of Sequence, etc.

2) It might be interesting to use fast_accel to fully accelerate the NFS traffic inside SecureXL and see if that impacts the NFS issue upon hard or soft failover. Because there is very little sync'ing of state between SecureXL/sim on cluster members, there may be more relaxed inspection windows present there or even some kind of recovery mechanism for handling this situation.

 

New 2021 IPS/AV/ABOT Immersion Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
Mark_Shaw
Explorer

Good morning

Firstly apologies for the lack of response. Our issues has not gone away, last night the team upgraded the Azure gateways R81 JHF T44 and upon failover, over 100 NFS mounts failed to reconnect and the same when we failed back

 

0 Kudos
AaronCP
Contributor

Hey @Mark_Shaw,

 

We're experiencing a very similar issue with our NFS shares, too. We have multiple Linux VMs running in Azure and if the NFS share is mounted at the time of a failover, the NFS share "hangs". This would occur when either the On-Prem or Azure IaaS firewalls failed over. I managed to prevent the issue occurring when the On-Prem firewalls failed over by adding the NFS destination port of 2049 into the user.def file on the management server, but we're still experiencing the issue in Azure. I was wondering if TAC had given you any suggestions?

 

We have a similar architecture to yourself:

 

On-Prem - 15600 Quantum running R80.40 T120

Azure - CloudGuard running R80.40 T120

 

Thanks,

 

Aaron.

0 Kudos
Mark_Shaw
Explorer

Good morning

Firstly apologies for the lack of response. Our issues has not gone away, last night the team upgraded the Azure gateways R81 JHF T44 and upon failover, over 100 NFS mounts failed to reconnect and the same when we failed back

 

Have you had any further luck?

0 Kudos