Solved: Connection failure on firewall failover - Out of s...

Muazzam · ‎2024-10-31

Hardware: 23500
OS: GAIA R81.10 Take 94
Active / Standby Setup
ThroughPut - Typical: 250Mbps

Many applications do not survive on cluster failover. They do not recover, only solution is to re-start the application.

At the time of failover, we see hundreds of out of state packets and logs showing first packet isn't syn with push-ack flags.

When we fail from member A to B - we did not see any traffic passing from member B unless the app is re-started.
Checked the # of connections on the connection table and for some IP addresses there is a big difference; Example 800 on active member and 600 on standby member. All TCP based traffic with no UDP component.

Not sure but I believe that this started after we change the clustering method from VRRP to ClusterXL but I may be wrong here.

Questions:
Is the difference in the # of connections in connections table acceptable?
Can this bee the issue explained in SK180253?
Any command to check if the 2 firewalls are out of Sync?

HeikoAnkenbrand · ‎2024-11-02

Hi @Muazzam,

@Timothy_Hall had already described the important points.

From a performance point of view, it makes more sense not to synchronize connections immediately in a cluster environment.

For example, with http/https I often set the value “start synchronizing 3 sec. after connection initiation” to a higher value as the “tcp start timer” in the “global proprerties”. This means that the TCP sessions are only synchronized in a cluster once the SYN/SYN-ACK phase has been completed. This has the advantage that the sessions are not immediately synchronized in the event of a DDoS attack.
This reduces CPU performance in a cluster environment in the event of a DDoS attack.

For example, in my opinion it makes no sense to synchronize DNS queries, as they are repeated after 2 or 4 seconds.

So to your question. It is partly by design that the sessions are not synchronized immediately.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

View solution in original post

the_rock · ‎2024-10-31

It could be the sk, but cant be 100% sure. To check the sync state, you can run below cluster commands.

Andy

****************

cphaprob state

cphaprob -a if

cphaprob -i list

cphaprob -l list

cphaprob syncstate

JozkoMrkvicka · ‎2024-11-01

Blame the firewall is logical since you see many drops. On the other hand, if application was not designed according to the network RFC, the blame is on application. If application didnt get reply within couple of second/packets, why it is not trying to re-establish connection using new 3-way handshake ?

If you open TAC case and mention that you have T94 released in March 2023, they will most probably instruct you to update to the latest Take.

Kind regards,
Jozko Mrkvicka

the_rock · ‎2024-11-01

Or at least take 169, which is recommended.

Andy

Chris_Atkinson · ‎2024-11-01

After troubleshooting the issue sufficiently (to eliminate potential causes) you may wish to reject rather than drop out of state traffic to allow the applications to better understand that it should reinitiate a new connection. This may not be viable in all scenarios however.

For more information on how to control the behavior of the gateway in this regard please see: sk60768

CCSM R77/R80/ELITE

Timothy_Hall · ‎2024-11-01

1) Find a log for a connection that was dropped as "out of state" upon failover and determine the service object that matched that particular connection in your rulebase. Open that service for editing and on the Advanced screen make sure that this box has not been unchecked for a selective synchronization setup (it is set by default on all services):

selective sync.png

2) Next if you are using the IPS blade, check this setting on your gateway/cluster object and ensure it has not been changed from the default of "prefer connectivity". If a connection is undergoing streaming inspection in the Medium or Firewall/F2F paths, it will be killed "out of state" upon failover if "prefer security" is set:

3) If you have a lot of rapid-fire, short lived connections that don't exist for more than 3 seconds they will be killed "out of state" upon failover with this default setting. If this is indeed the case try disabling it and see if that helps, although this will increase the amount of sync traffic between the cluster members substantially:

4) Make sure your sync network is healthy and not struggling, look at the error counters for the Sync interface in the outputs of netstat -ni and cphaprob syncstat.

5) Beyond those you'll need to run commands like fw tab -t connections -u -f and fw ctl conntab on both the active and standby to determine which specific connections are not getting sync'ed which will hopefully lead to why.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

AkosBakos · ‎2024-11-01

Hi @Muazzam

I have 2 lines in my local KB 🙂

I use this usually at scheduled failover eg.: maintanance.

These are the followings:

#fw ctl set int fw_reject_non_syn 1

https://support.checkpoint.com/results/sk/sk60768

Have you tried the this command?

And @Timothy_Hall 's explanation is the best! 🙂

Akos

----------------
\m/_(>_<)_\m/

HeikoAnkenbrand · ‎2024-11-02