Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Muazzam
Contributor
Contributor
Jump to solution

Connection failure on firewall failover - Out of state packets

Hardware: 23500
OS: GAIA R81.10 Take 94
Active / Standby Setup
ThroughPut - Typical: 250Mbps


Many applications do not survive on cluster failover. They do not recover, only solution is to re-start the application.

At the time of failover, we see hundreds of out of state packets and logs showing first packet isn't syn with push-ack flags.

When we fail from member A to B - we did not see any traffic passing from member B unless the app is re-started.
Checked the # of connections on the connection table and for some IP addresses there is a big difference; Example 800 on active member and 600 on standby member. All TCP based traffic with no UDP component.

Not sure but I believe that this started after we change the clustering method from VRRP to ClusterXL but I may be wrong here.

Questions:
Is the difference in the # of connections in connections table acceptable?
Can this bee the issue explained in SK180253?
Any command to check if the 2 firewalls are out of Sync?

(1)
1 Solution

Accepted Solutions
HeikoAnkenbrand
Champion Champion
Champion

 

Hi @Muazzam,

@Timothy_Hall  had already described the important points.

From a performance point of view, it makes more sense not to synchronize connections immediately in a cluster environment.

For example, with http/https I often set the value “start synchronizing 3 sec. after connection initiation” to a higher value as the “tcp start timer” in the “global proprerties”. This means that the TCP sessions are only synchronized in a cluster once the SYN/SYN-ACK phase has been completed. This has the advantage that the sessions are not immediately synchronized in the event of a DDoS attack.
This reduces CPU performance in a cluster environment in the event of a DDoS attack.

For example, in my opinion it makes no sense to synchronize DNS queries, as they are repeated after 2 or 4 seconds.

So to your question. It is partly by design that the sessions are not synchronized immediately.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

8 Replies
the_rock
Legend
Legend

It could be the sk, but cant be 100% sure. To check the sync state, you can run below cluster commands.

Andy

****************

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstate

 

0 Kudos
JozkoMrkvicka
Authority
Authority

Blame the firewall is logical since you see many drops. On the other hand, if application was not designed according to the network RFC, the blame is on application. If application didnt get reply within couple of second/packets, why it is not trying to re-establish connection using new 3-way handshake ?

If you open TAC case and mention that you have T94 released in March 2023, they will most probably instruct you to update to the latest Take.

Kind regards,
Jozko Mrkvicka
the_rock
Legend
Legend

Or at least take 169, which is recommended.

Andy

0 Kudos
Chris_Atkinson
Employee Employee
Employee

After troubleshooting the issue sufficiently (to eliminate potential causes) you may wish to reject rather than drop out of state traffic to allow the applications to better understand that it should reinitiate a new connection. This may not be viable in all scenarios however.

For more information on how to control the behavior of the gateway in this regard please see: sk60768

CCSM R77/R80/ELITE
0 Kudos
Timothy_Hall
Legend Legend
Legend

1) Find a log for a connection that was dropped as "out of state" upon failover and determine the service object that matched that particular connection in your rulebase.  Open that service for editing and on the Advanced screen make sure that this box has not been unchecked for a selective synchronization setup (it is set by default on all services):

selective sync.png

 

2) Next if you are using the IPS blade, check this setting on your gateway/cluster object and ensure it has not been changed from the default of "prefer connectivity".  If a connection is undergoing streaming inspection in the Medium or Firewall/F2F paths, it will be killed "out of state" upon failover if "prefer security" is set:

prefer.png

 

3) If you have a lot of rapid-fire, short lived connections that don't exist for more than 3 seconds they will be killed "out of state" upon failover with this default setting.  If this is indeed the case try disabling it and see if that helps, although this will increase the amount of sync traffic between the cluster members substantially:

delayed.png

 

4) Make sure your sync network is healthy and not struggling, look at the error counters for the Sync interface in the outputs of netstat -ni and cphaprob syncstat.

5) Beyond those you'll need to run commands like fw tab -t connections -u -f and fw ctl conntab on both the active and standby to determine which specific connections are not getting sync'ed which will hopefully lead to why.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
AkosBakos
Leader Leader
Leader

Hi @Muazzam 

I have 2 lines in my local KB 🙂

I use this usually at scheduled failover eg.: maintanance. 

These are the followings:

#fw ctl set int fw_reject_non_syn 1

https://support.checkpoint.com/results/sk/sk60768

Have you tried the this command? 

And @Timothy_Hall 's explanation is the best! 🙂

Akos

 

----------------
\m/_(>_<)_\m/
0 Kudos
HeikoAnkenbrand
Champion Champion
Champion

 

Hi @Muazzam,

@Timothy_Hall  had already described the important points.

From a performance point of view, it makes more sense not to synchronize connections immediately in a cluster environment.

For example, with http/https I often set the value “start synchronizing 3 sec. after connection initiation” to a higher value as the “tcp start timer” in the “global proprerties”. This means that the TCP sessions are only synchronized in a cluster once the SYN/SYN-ACK phase has been completed. This has the advantage that the sessions are not immediately synchronized in the event of a DDoS attack.
This reduces CPU performance in a cluster environment in the event of a DDoS attack.

For example, in my opinion it makes no sense to synchronize DNS queries, as they are repeated after 2 or 4 seconds.

So to your question. It is partly by design that the sessions are not synchronized immediately.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
Timothy_Hall
Legend Legend
Legend

Fully agreed @HeikoAnkenbrand, the 3-second sync delay has solved most of the sync network health issues encountered in the past and is a good default setting performance wise.  Sync network bandwidth jumping 10X from 100Mbps to 1Gbps certainly helped too.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events