Lots of First packet isn't SYN

Teddy_Brewski · ‎2025-02-04

Hello,

R81.20 Take 92 running on open servers.

We're experiencing a lot of First packet isn't SYN drops which also seem to affect legitimate traffic (HTTPS and RDP).

The TCP flag is almost always PUSH-ACK with occasional ACK.

It's not happening all the time, but only during peak hours, hence it seems to be linked to the load. Perhaps it's worth mentioning but it's also happening to the devices that act as proxies (for example Sophos ZTNA). Direct RDP connections from a server/workstation A to a server/workstation B work flawlessly, but connections from ZTNA gateways get occasionally interrupted (with First packet isn't SYN PUSH-ACK logged) and users experience disconnections and must reconnect.

From the application perspective, we see the following in the logs:

Reading from WebSocket failed. websocket: close 1006 (abnormal closure): unexpected EOF","time

The gateway has plenty of RAM (64GB) and CPU load is around 35-45%. The number of concurrent connections is around 25-30K during peak hours, with 55K configured as a maximum value.

What we tried so far:

- increase the timeout for RDP and HTTPS;
- disable "Smart Connection Reuse" feature;
- install the latest Take;
- fail-over to the standby and reboot.

Thank you in advance for any tips and hints!

Lloyd_Braun · ‎2025-02-04

The fixed 55k connection capacity cap makes me suspicious of aggressive aging. Are you seeing any indications of aggressive aging getting triggered, in the logs? The symptoms you describe might be indicative of shortened idle timeouts. 'fw ctl pstat' output might be helpful to see if you are maxing out connections occasionally, or check peak on 'fw tab -t connections -s'

Teddy_Brewski · ‎2025-02-04

Thank you @Lloyd_Braun

Note that I increased the maximum value yesterday to 55k, before it was 35k, so I can imagine aggressive aging was indeed happening. Right now it says it's enabled, but not active:

# fw ctl pstat

System Capacity Summary:
Memory used: 3% (1597 MB out of 47603 MB) - below watermark
Concurrent Connections: 52% (28625 out of 54900) - below watermark
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 4991221760 bytes in 1218560 (4096 bytes) blocks using 1 pool
Total memory bytes used: 0 unused: 4991221760 (100.00%) peak: 80998672
Total memory blocks used: 0 unused: 1218560 (100%) peak: 24134
Allocations: 4238300187 alloc, 0 failed alloc, 4237687164 free

System kernel memory (smem) statistics:
Total memory bytes used: 6002284284 peak: 6070898176
Total memory bytes wasted: 16634909
Blocking memory bytes used: 16458184 peak: 19762280
Non-Blocking memory bytes used: 5985826100 peak: 6051135896
Allocations: 29909386 alloc, 0 failed alloc, 29901303 free, 0 failed free
vmalloc bytes used: 5979740264 expensive: no

Kernel memory (kmem) statistics:
Total memory bytes used: 1079372560 peak: 1144346904
Allocations: 4268204704 alloc, 0 failed alloc
4267585260 free, 0 failed free
External Allocations:
Packets: 69568, SXL: 48943407, Reorder: 0
Zeco: 0, SHMEM: 2120, Resctrl: 0
ADPDRV: 0, PPK_CI: 14033552, PPK_CORR: 0

Cookies:
1560053731 total, 0 alloc, 0 free,
604382 dup, 2000217892 get, 67806603 put,
2737444945 len, 845540128 cached len, 0 chain alloc,
0 chain free

Connections:
126859219 total, 54315325 TCP, 46211064 UDP, 26324361 ICMP,
8469 other, 432 anticipated, 124399 recovered, 28625 concurrent,
33910 peak concurrent

Fragments:
2568 fragments, 1214 packets, 0 expired, 0 short,
0 large, 0 duplicates, 0 failures

NAT:
33276078/0 forw, 29235265/0 bckw, 57748477 tcpudp,
4785754 icmp, 18047884-7404210 alloc

# fw tab -t connections -s
HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 30267 33736 90599

the_rock · ‎2025-02-04

Hey Teddy,

I would check values as per my screenshot below. Whenever I help people with such issues, this is literally the first thing I verify and I ALWAYS make sure bottom option is checked, I find it important.

Andy

Teddy_Brewski · ‎2025-02-04

Many thanks @the_rock -- I'll give it a try.

I always thought about the fixed value as a precaution measure. What happens if it gets saturated? I think we have enough RAM, but will it grow continuously until the gateway crashes?

the_rock · ‎2025-02-04

I had escalation engineer explained that to me while back and made total sense...he told me the way it works when its automatic is that gateway "decides" how the consumption (if you will) is utilized, so you definitely should not see saturation.

Hope that helps.

Andy

Chris_Atkinson · ‎2025-02-04

Have you manually set the connection limit or is it auto.

How many cores are assigned/licensed?

CCSM R77/R80/ELITE

Teddy_Brewski · ‎2025-02-04

Hi @Chris_Atkinson

It's set manually to 55k (I increased it yesterday from 35k).

The server is HP ProLiant DL360 Gen10, Xeon Gold 5122 @ 3.60GHz (4 cores), licensed to 4 cores.

Chris_Atkinson · ‎2025-02-04

Agree with the aggressive aging analysis here with the fixed value.

Have you had memory utilization issues in the past that has lead you to constrain the connection table?

CCSM R77/R80/ELITE

Teddy_Brewski · ‎2025-02-04

Thank you -- no issues with RAM, I always thought it was a precaution measure. In our case it is a standard gateway, but with VSX you can't even set it to automatic.

Chris_Atkinson · ‎2025-02-04

You should be good to increase it more in that case, if you choose to keep it manually set you will need to continue to periodically manage it to stay ahead of growth / demand to avoid issues such as this.

You are correct regarding VSX note the default there is now 50,000 as of R82.

CCSM R77/R80/ELITE

the_rock · ‎2025-02-04

@Teddy_Brewski

To add to all that @Chris_Atkinson told you, I can remember at least 6 customers who I helped with in the past for this problem and they all had number of connection manually set. As soon as we changed it to auto and set drop optimization option and installed the policy, it fixed the issue.

I cant guarantee same would happen in your case, but Im confident it would help.

Andy

Teddy_Brewski · ‎2025-02-04

I appreciate your help @the_rock and @Chris_Atkinson !

Changed to Automatic with Drop Optimization checked -- let's hope that was it!

the_rock · ‎2025-02-04

Great! I really hope it helps.

Btw, maybe in 1-2 hours time, run watch -d fw ctl pstat command and see what you get. Example in my R82 lab.

Andy

Every 2.0s: fw ctl pstat Tue Feb 4 11:01:50 2025

System Capacity Summary:
Memory used: 8% (1554 MB out of 17429 MB) - below watermark
Concurrent Connections: 17 (Unlimited)
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 1824522240 bytes in 445440 (4096 bytes) blocks using
1 pool
Total memory bytes used: 0 unused: 1824522240 (100.00%) peak: 38
9180804
Total memory blocks used: 0 unused: 445440 (100%) peak: 9731
1
Allocations: 4111958150 alloc, 0 failed alloc, 4110732366 free

System kernel memory (smem) statistics:
Total memory bytes used: 3012513700 peak: 3271060328
Total memory bytes wasted: 30887124
Blocking memory bytes used: 29797016 peak: 30149436
Non-Blocking memory bytes used: 2982716684 peak: 3240910892
Allocations: 855986 alloc, 0 failed alloc, 843799 free, 0 failed free

Teddy_Brewski · ‎2025-02-05

Hi @the_rock

Here is how it looks in the morning:

# fw ctl pstat

System Capacity Summary:
Memory used: 4% (2242 MB out of 47603 MB) - below watermark
Concurrent Connections: 24928 (Unlimited)
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 4991221760 bytes in 1218560 (4096 bytes) blocks using 1 pool
Total memory bytes used: 0 unused: 4991221760 (100.00%) peak: 139892588
Total memory blocks used: 0 unused: 1218560 (100%) peak: 40232
Allocations: 1704338159 alloc, 0 failed alloc, 1703788411 free

Not sure whether this is a side effect or a coincidence, but we've experienced a failover during the night due to the high CPU observed on a primary node:

[Wed Feb 5 03:00:29 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:00:51 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 22 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:00:52 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (84%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:12 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 20 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:01:13 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:23 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:01:24 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:35 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:02:03 2025] [fw4_1];check_other_machine_activity: Update state of member id 2 to DEAD, didn't hear from it since 476171.1 and now 476174.4
[Wed Feb 5 03:02:03 2025] [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
[Wed Feb 5 03:02:05 2025] [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:02:08 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond2
[Wed Feb 5 03:02:08 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond3
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local probing has stopped on interface: bond3
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
[Wed Feb 5 03:02:14 2025] [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:02:18 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond3
[Wed Feb 5 03:02:18 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-120207-1: Local probing has stopped on interface: bond3
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
[Wed Feb 5 03:02:32 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (98%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:02:43 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:24:29 2025] [fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (81%) on the local member increased above the configured threshold (80%).
[Wed Feb 5 03:25:43 2025] sched: RT throttling activated
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:26:39 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 45 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

Tried to find out exact numbers with 'cpview -t', but the closest one I got to was 03:00:36 with the CPU reaching 56%.

Interesting that it reported bond4 as down, although I couldn't find anything in the logs of the corresponding switch.

the_rock · ‎2025-02-05

What does cpu show from cpview?

Andy

Teddy_Brewski · ‎2025-02-05

Right now it's 35-40%, at 3am it was around 56%, but with 'cpview -t' I can only move +- 1 minute, and I don't see high CPU the minute before/after.

the_rock · ‎2025-02-05

You can do cpview -t, then press t and it will give an option to enter date/time.

Andy

Teddy_Brewski · ‎2025-02-05

Yep, exactly that. And when I enter 03:01:24 I see 56% CPU load, minute after/before it's back to 25-30%. I think it happened within seconds and couldn't be replayed.

the_rock · ‎2025-02-05

Got it. Is the same fw master as before or its flipped over now?

Andy

Teddy_Brewski · ‎2025-02-05

The same.

the_rock · ‎2025-02-05

K, perfect...so, how is the situation currently? Any better than before?

Andy

Teddy_Brewski · ‎2025-02-05

Right now it's fine, no failover is happening, however, all changes that we've done yesterday, unfortunately didn't help. We still experience RDP disconnections with 'First packet isn't SYN -- PUSH-ACK' messages logged, so it must be something else...

the_rock · ‎2025-02-05

Did you verify routing is correct? If so, maybe TAC case would be needed...

Andy

Teddy_Brewski · ‎2025-02-05

I did. Can't see anything wrong. All our routing is static, there is no packet loss between interfaces. The disconnections are happening occasionally and clearly linked to the load or something that triggers them, cause during the weekends or out of office hours it's not happening.

the_rock · ‎2025-02-05

Not sure if this setting might be related, but worth looking into..

Andy

Teddy_Brewski · ‎2025-02-05

It was set to Rematch connections. I will change it, although it's definitely happening without pushing the policy.

the_rock · ‎2025-02-05

K, thats fair. In such case, I doubt it would make any difference.

Andy

Chris_Atkinson · ‎2025-02-04

No worries, if you want to optimize connection table size things like tweaking the default DNS service timeout can be effective e.g.

R80.x Performance Tuning Tip - Connection Table - Check Point CheckMates

CCSM R77/R80/ELITE

spottex · ‎2025-02-04

Is the traffic traversing two firewalls to its destination by any chance?
It maybe Smart Connection Reuse feature keeping a closed session open on one of the firewalls - probably client side which then converts the SYN to an ACK which the second FW will drop as the session does not exist.

Are you a member of CheckMates?

Lots of First packet isn't SYN