Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
WesEvernden
Participant

Stuck entry in the FW state table

Hi,

We have a problem where we get a connection stuck FW state table that we can get rid of. I am looking for ideas how deal with this scenario.

The problem is between a NFS Client and the NFS Server. It occurs when we have some kind of network issue, like a switch reboot during a change window, or last week it was an IPS failing closed for 2 hours.

What we end up with is connection in the state table in state: DST_FIN.  The order of events that get us into this problem  aren’t 100% clear, aside from the network failure, but regardless, the connection in DST_FIN state is blocking NFS from recovering so I have focused on that.

This is the setup:  Client, the Server with two FWs between them, both running R80.30. I'll use these names: Client --> FW_C --> FW_S --> Server.

After the network outage the Client has no connection to the Server and continuously tries to reconnect to the server using the same source port as the original connection, port 1023. The server is listening on port 2049. So, the Client is sending SYN, SYN, SYN, SYN, SYN, SYN, RST … start over.

Those packets arrive at the FW_C which has this state table entry: inbound, src=[Client,1023], dest=[Server,2049],  3600/3605, state=DST_FIN.

Here are my observations about happens as SYNs and the RST packets arrive at FW_C from the Client.

  1. The SYN is changed to an ACK (Smart Connection Reuse) and sent on. Confirmed with packet captures. The ACK arrives at FW_S which drops it as out of state, FW_S doesn’t send an RST when it drops the ACK.
  2. The SYN, or the newly minted ACK, is resetting the idle timer on the connection. Confirmed with multiple state table dumps, when haven’t seen the idle time drop below 60s
  3. The RST sent after six SYNs does not cause connection table entry to be removed.

The pattern continues, last time for days.

I am thinking disable Smart Connection Reuse is the best option. Having said that we haven’t done that before and can’t really describe what the downsides of that might be in our environment.

Thoughts?

Thanks,

-Wes

0 Kudos
15 Replies
Wolfgang
Authority
Authority

@WesEvernden  having two firewalls between source and destination and SmartConenctionReuse enabled will be sometime very tricky to handle. In the past we had such a problem with communication between two webproxies. Changing SYN to a ACK  on  gatewayA was followed by an drop on gatewayB with "out of state". This is normal behaviour., it is how it works.

The connection never timed out on gatewayA because the source proxy tried to fast to reestablish a connection on the same source and destination ports.

We checked more then once the timout values on both gateways for tcp connection states (start timeout, end timeout, session timeout etc. and too the timeouts of the used service objects of the matching rules) They must be equal on both gateways. Additional we disabled Smart Connection reuse for port TCP/8080, this was our communication port for the proxy<=>proxy connection  

Be aware you have to set these values twice kernel and SIM, follow instructions "Smart Connection Reuse" feature modifies some SYN packets 

WesEvernden
Participant

Thanks for the feedback Wolfgang. 

0 Kudos
Alexander_Wilke
Advisor

Hi,

the problem here is not that the client ist trying to re-establisheing the connection to fast. The problem ist that the Firewall FW_C ist resetting the idle timeout back to 3600/3600 after every SYN of the client. This is not correct

 

If SmartConnectionReuse Feature is modifing a SYN to an ACK this is happening because it wants to check if this SYN is valid or not. But is only working if the server responds. Depending on the server response the firewall can do a decision:

a) delete the existing connection and allow a new connection based on same src.port

b) keep the existing session and do not allow to establish a new connection

This is totally fine and secure and working BUT ONLY if the server responds.

 

If the server is not able to respond because any device in between is dropping the packets then firewall can not do any decision and must for securty reasons keep the existing session.

Now there ist the problem.

The SYN to ACK conversion resets the Idle Timeout on the existing session and this is wrong. SYN to ACK conversion feature is to check with the server if the connection is valid if the client is valid. If there is no response the firewall MUST NOT touch anything of the existing connection.

 

Unfortunately the firewall resets the Idle every time a SYN is converted to an ACK back to the confiured default idle timeout of 3600/3600. The Firewall must ignore the "SYN to ACK" packet for Idle timeout reset. The timeout must be as it is as long as there is a valid, verified packet for the existing connection. 

 

This will result in that the session in the firewall will timeout in a regular way until idle reached 0/3600. Then the client can re-establish the connection.

 

Whatever the client is doing or an attacker and whatever the server is doing or something in between, the Firewall MUST NOT touch an existing connection if the packet it receives is not fully verifierd and valid for this existing session.

 

TL;DR

If SmartConnectionReuse Feature is converting a SYN to an ACK packet to verify if the SYN is valid or not the SmartConnectionReuse Featuzre MUST NOT modify the existing State Table entry Idle Timeout.

0 Kudos
WesEvernden
Participant

Thanks Alexander. I agree completely. The idle timeout should not be reset.

0 Kudos
AaronCP
Advisor

Hi Wes,

 

Have you made any progress with this issue? We are experiencing the exact same issue as yourself, but to no avail. Our topology is: NFS Client | Azure Network Gateway | Check Point Cluster | NFS Server.

 

If we have an interruption to a connection whilst the NFS is mounted, the mount just "hangs" and will not clear unless the Azure VM is rebooted. We can see from packet captures that the NFS client is trying to re-establish the connection on the same source port, yet it fails.

 

We have disabled the Smart Connection Reuse feature but this has not resolved the issue for us. We have also wrote a INSPECT function as per sk11088, but this has not helped either.

 

Would be fab if you could share any progress you've made with this!

 

Thanks,

 

Aaron.

0 Kudos
WesEvernden
Participant

Hi Aaron,

No progress. I did get ticket open with CP thinking they might be able to add some clarity but it went no where.

-Wes

0 Kudos
gynemeth
Explorer

Try sk174323, it helped us. Of course the dport has to be modified to 2049

0 Kudos
Alexander_Wilke
Advisor

Hi @gynemeth can you provide an example for that solution?
As far as I understand CheckPoint (Diamond Support) this is not working with R80.20SP on 64k environments.

@AaronCP 
CheckPoint is pin pointing alsways to the application and does to really show interest to solve the issue on their side.
They said the developed a fix which prevents the firewall to refresh the Idle Timeout if there is a modified SYN to ACK packet BUT they said that this will cause performance issues.

Unfortunately they do not describe whe we will recognize these performance issues. If this will be an issue for all connections/the complete Gateway or only for this specific connection or only on specific situations. Sending these question to the support and did not receive any useful feedback.

So from my perspective (R80.20SP + 64k) we cannot use the sk174323 (don't know why) and the provided fix will cause performance issues we can not calcultae because no one describes why and what will happen.

0 Kudos
gynemeth
Explorer

Sorry,  our environment is different. We use R81.10 on 28000 HW.

0 Kudos
PhoneBoy
Admin
Admin

I suspect, based on the nature of the fix, the "performance issue" is that specific connection will not be templated by SecureXL.
This mostly impacts the amount of time it takes to fully establish the connection (e.g. complete the three way TCP handshake).
Unless you have a lot of these connections, I would not expect the performance impact of this to be substantial.
The connection itself (once established) should still be fully accelerated.

The above is my assessment and may be completely wrong.
One other thing to note: if your management is on a different version than your gateways, you need to modify the user.def file in the relevant backward compatibility directory. 

AaronCP
Advisor

Hey @Alexander_Wilke

 

What fix were you provided with?

0 Kudos
Alexander_Wilke
Advisor

@AaronCP 

They just told me they have a fix (no more details right now) but they did not explain me why and when performance issues might appear. So I did not install it and did not request it.

 

But after the possible explaination of @PhoneBoy a few posts before I re-requested the fix, details of performance impact, fix number etc.

I also added our Success Manager to the E-Mail and our personal Diamond Engineer to make sure we will get als the requested information.

 

I also added the hint of @WesEvernden that there might be other situations which could cause issues like DST-FIN state of a connection.

 

Regards

0 Kudos
AaronCP
Advisor

Hi @WesEvernden,

 

I have tried SK174323 but it's not helped. The syntax I'm using in the user.def.FW1 file on the Management Server is as follows:

 

deffunc allow_syn_estab_count_rst() { ((dst = a.b.c.d) and (dport = 2049)) };

 

Did you disable the Smart Connection Reuse feature? We currently have it disabled in our environment.

 

Thanks,

 

Aaron.

0 Kudos
WesEvernden
Participant

I took a look at STK174323. Looking more closely at the first paragraph of the Cause section the 1st sentence is:

"When the Security Gateway encounters a TCP [SYN] packet which belongs to an already established connection, the TCP [SYN] packet refreshes the established connection's expiration time."

This isn't our experience since our connections are not in ESTABLISHED state, they are in DST_FIN state and the packet refreshes the connection expiration time.
 
-Wes
 
-Wes
 
Cause
When the Security Gateway encounters a TCP [SYN] packet which belongs to an already established connection, the TCP [SYN] packet refreshes the established connection's expiration time. As a result of the connection state (established), the Security Gateway drops the packet with "SYN on established connection", without replying to the Client.

 

 

0 Kudos
Alexander_Wilke
Advisor

Hello,

 

we have the same issue and discussed this for months or years with CheckPoint.

Fortunately we now got a fix announced (not tested yet) that solves a bug in SmartConnection Reuse.

 

Example

client --> FW-A --> FW-B --> Server

1.) Client and server communicate with NFS and have an established session

2.) FW-B is losing the connection state entry (Failover and or other reason)

3.) FW-B drops packets from server/client client/server

4.) Client after some minutes trys to establish a new NFS session with new 3-way-handshake but the same source port.

5.) FW-A does "SmartConnReuse" and modifies "SYN to ACK" because FW-A has state entry "established". FW-B drops the "modified "ACK" with "First packet isn't SYN"

 

Problem:

SmartConnectionRuese on FW-A resets the TCP Idle timeout for the established session every time there is a new SYN.

That's wrong. Ist must keep the ile timeout as it is unless there is a response to the modified ACK. To solve this you have to stop the NFS client to send new SYNs as long as the FW-A has the "established" state entry or you delete the entry manually.

 

The fix will allow a new kernel parameter to NOT refresh the idle timeout on an established session if a new SYN arrives:

This will make sure that established session may timeout if needed and not block new connections on the same source port forever - but for at least the idle timeout:

 

Fix IDs

R80.20sp - PRHF-26499
fw1_wrapper_HOTFIX_R80_20SP_T334_887_MAIN_GA_FULL

R84.40 - PRHF-26491
fw1_wrapper_HOTFIX_R80_40_JHF_T173_399_MAIN_GA_FULL

R81.10 - PRHF-26493
fw1_wrapper_HOTFIX_R81_10_JHF_T66_957_MAIN_GA_FULL

 

Parameter

syn_refresh_conn

 

PS:
However we do not know why NFS connections break after a failover. We have "keep all connections open" enabled globally and on the services but still not working reliable

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events