- CheckMates
- :
- Products
- :
- Quantum
- :
- Security Gateways
- :
- Connection failure on firewall failover - Out of s...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Connection failure on firewall failover - Out of state packets
Hardware: 23500
OS: GAIA R81.10 Take 94
Active / Standby Setup
ThroughPut - Typical: 250Mbps
Many applications do not survive on cluster failover. They do not recover, only solution is to re-start the application.
At the time of failover, we see hundreds of out of state packets and logs showing first packet isn't syn with push-ack flags.
When we fail from member A to B - we did not see any traffic passing from member B unless the app is re-started.
Checked the # of connections on the connection table and for some IP addresses there is a big difference; Example 800 on active member and 600 on standby member. All TCP based traffic with no UDP component.
Not sure but I believe that this started after we change the clustering method from VRRP to ClusterXL but I may be wrong here.
Questions:
Is the difference in the # of connections in connections table acceptable?
Can this bee the issue explained in SK180253?
Any command to check if the 2 firewalls are out of Sync?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Muazzam,
@Timothy_Hall had already described the important points.
From a performance point of view, it makes more sense not to synchronize connections immediately in a cluster environment.
For example, with http/https I often set the value “start synchronizing 3 sec. after connection initiation” to a higher value as the “tcp start timer” in the “global proprerties”. This means that the TCP sessions are only synchronized in a cluster once the SYN/SYN-ACK phase has been completed. This has the advantage that the sessions are not immediately synchronized in the event of a DDoS attack.
This reduces CPU performance in a cluster environment in the event of a DDoS attack.
For example, in my opinion it makes no sense to synchronize DNS queries, as they are repeated after 2 or 4 seconds.
So to your question. It is partly by design that the sessions are not synchronized immediately.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be the sk, but cant be 100% sure. To check the sync state, you can run below cluster commands.
Andy
****************
cphaprob state
cphaprob -a if
cphaprob -i list
cphaprob -l list
cphaprob syncstate
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Blame the firewall is logical since you see many drops. On the other hand, if application was not designed according to the network RFC, the blame is on application. If application didnt get reply within couple of second/packets, why it is not trying to re-establish connection using new 3-way handshake ?
If you open TAC case and mention that you have T94 released in March 2023, they will most probably instruct you to update to the latest Take.
Jozko Mrkvicka
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or at least take 169, which is recommended.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After troubleshooting the issue sufficiently (to eliminate potential causes) you may wish to reject rather than drop out of state traffic to allow the applications to better understand that it should reinitiate a new connection. This may not be viable in all scenarios however.
For more information on how to control the behavior of the gateway in this regard please see: sk60768
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) Find a log for a connection that was dropped as "out of state" upon failover and determine the service object that matched that particular connection in your rulebase. Open that service for editing and on the Advanced screen make sure that this box has not been unchecked for a selective synchronization setup (it is set by default on all services):
2) Next if you are using the IPS blade, check this setting on your gateway/cluster object and ensure it has not been changed from the default of "prefer connectivity". If a connection is undergoing streaming inspection in the Medium or Firewall/F2F paths, it will be killed "out of state" upon failover if "prefer security" is set:
3) If you have a lot of rapid-fire, short lived connections that don't exist for more than 3 seconds they will be killed "out of state" upon failover with this default setting. If this is indeed the case try disabling it and see if that helps, although this will increase the amount of sync traffic between the cluster members substantially:
4) Make sure your sync network is healthy and not struggling, look at the error counters for the Sync interface in the outputs of netstat -ni and cphaprob syncstat.
5) Beyond those you'll need to run commands like fw tab -t connections -u -f and fw ctl conntab on both the active and standby to determine which specific connections are not getting sync'ed which will hopefully lead to why.
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Muazzam
I have 2 lines in my local KB 🙂
I use this usually at scheduled failover eg.: maintanance.
These are the followings:
#fw ctl set int fw_reject_non_syn 1
https://support.checkpoint.com/results/sk/sk60768
Have you tried the this command?
And @Timothy_Hall 's explanation is the best! 🙂
Akos
\m/_(>_<)_\m/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Muazzam,
@Timothy_Hall had already described the important points.
From a performance point of view, it makes more sense not to synchronize connections immediately in a cluster environment.
For example, with http/https I often set the value “start synchronizing 3 sec. after connection initiation” to a higher value as the “tcp start timer” in the “global proprerties”. This means that the TCP sessions are only synchronized in a cluster once the SYN/SYN-ACK phase has been completed. This has the advantage that the sessions are not immediately synchronized in the event of a DDoS attack.
This reduces CPU performance in a cluster environment in the event of a DDoS attack.
For example, in my opinion it makes no sense to synchronize DNS queries, as they are repeated after 2 or 4 seconds.
So to your question. It is partly by design that the sessions are not synchronized immediately.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fully agreed @HeikoAnkenbrand, the 3-second sync delay has solved most of the sync network health issues encountered in the past and is a good default setting performance wise. Sync network bandwidth jumping 10X from 100Mbps to 1Gbps certainly helped too.
CET (Europe) Timezone Course Scheduled for July 1-2
