@Mark_Shaw interesting that the TP null profile seemed to help with "soft" failovers but not "hard" ones.
So someone else saw my response here and asked me if there were any updates, as they are seeing a similar issue with NFS and failovers. Here is my reply to them and I hope it is helpful, please note I am making some serious educated guesses here and doubt the following is 100% correct:
After thinking about this, I can think of two main things that are different between a "soft failover" (clusterXL_admin down) vs. a "hard failover" (rebooting or pulling power) on the active member.
1) There is a 2-2.5 second "dead timer" interval that happens with a hard failover where all traffic is lost/stalls, this usually manifests itself as losing 1-2 continuous ping packets. A soft failover is nearly instantaneous with the new member not needing to wait the dead interval to take over. In my experience a 2-2.5 second interruption in traffic during a hard failover is not enough to break stuff.
2) With a soft failover, state sync is maintained continuously between the two members as they trade active/standby roles. With a hard failover it stops suddenly and some sync updates may be lost, this is most likely to affect traffic that is very chatty which NFS certainly is. Earlier they mentioned having to clear entries in the connection table to get things working again on the newly-active member which tracks with this theory, and that they didn't have to clear connections with a TP null profile in place after a soft failover.
It is almost like some sync updates were lost during the hard failover, and the difference in some kind of sequence number or perhaps an anti-replay feature kills the the NFS traffic through the newly-active member, as the difference is outside some kind of acceptable window for inspection. This seems much more likely than scenario #1.
Suggestions:
1) Assuming the null TP profile is in place, try adding an Inspection Settings exception for the NFS hosts for Protection "Any". This is very different from a null TP profile. If the NFS traffic is fragmented, it could be running afoul of the IP Fragments Inspection Settings protection or possibly some of the TCP ones like Invalid Retransmission, Out of Sequence, etc.
2) It might be interesting to use fast_accel to fully accelerate the NFS traffic inside SecureXL and see if that impacts the NFS issue upon hard or soft failover. Because there is very little sync'ing of state between SecureXL/sim on cluster members, there may be more relaxed inspection windows present there or even some kind of recovery mechanism for handling this situation.
Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com