Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Teddy_Brewski
Collaborator

Lots of First packet isn't SYN

Hello,

R81.20 Take 92 running on open servers.

We're experiencing a lot of First packet isn't SYN drops which also seem to affect legitimate traffic (HTTPS and RDP).

The TCP flag is almost always PUSH-ACK with occasional ACK.

It's not happening all the time, but only during peak hours, hence it seems to be linked to the load. Perhaps it's worth mentioning but it's also happening to the devices that act as proxies (for example Sophos ZTNA). Direct RDP connections from a server/workstation A to a server/workstation B work flawlessly, but connections from ZTNA gateways get occasionally interrupted (with First packet isn't SYN PUSH-ACK logged) and users experience disconnections and must reconnect.

From the application perspective, we see the following in the logs:

Reading from WebSocket failed. websocket: close 1006 (abnormal closure): unexpected EOF","time

The gateway has plenty of RAM (64GB) and CPU load is around 35-45%. The number of concurrent connections is around 25-30K during peak hours, with 55K configured as a maximum value.

What we tried so far:

- increase the timeout for RDP and HTTPS;
- disable "Smart Connection Reuse" feature;
- install the latest Take;
- fail-over to the standby and reboot.

Thank you in advance for any tips and hints!

0 Kudos
24 Replies
Lloyd_Braun
Collaborator

The fixed 55k connection capacity cap makes me suspicious of aggressive aging. Are you seeing any indications of aggressive aging getting triggered, in the logs? The symptoms you describe might be indicative of shortened idle timeouts. 'fw ctl pstat' output might be helpful to see if you are maxing out connections occasionally, or check peak on 'fw tab -t connections -s'

Teddy_Brewski
Collaborator

Thank you @Lloyd_Braun 

Note that I increased the maximum value yesterday to 55k, before it was 35k, so I can imagine aggressive aging was indeed happening.  Right now it says it's enabled, but not active:

# fw ctl pstat

System Capacity Summary:
Memory used: 3% (1597 MB out of 47603 MB) - below watermark
Concurrent Connections: 52% (28625 out of 54900) - below watermark
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 4991221760 bytes in 1218560 (4096 bytes) blocks using 1 pool
Total memory bytes used: 0 unused: 4991221760 (100.00%) peak: 80998672
Total memory blocks used: 0 unused: 1218560 (100%) peak: 24134
Allocations: 4238300187 alloc, 0 failed alloc, 4237687164 free

System kernel memory (smem) statistics:
Total memory bytes used: 6002284284 peak: 6070898176
Total memory bytes wasted: 16634909
Blocking memory bytes used: 16458184 peak: 19762280
Non-Blocking memory bytes used: 5985826100 peak: 6051135896
Allocations: 29909386 alloc, 0 failed alloc, 29901303 free, 0 failed free
vmalloc bytes used: 5979740264 expensive: no

Kernel memory (kmem) statistics:
Total memory bytes used: 1079372560 peak: 1144346904
Allocations: 4268204704 alloc, 0 failed alloc
4267585260 free, 0 failed free
External Allocations:
Packets: 69568, SXL: 48943407, Reorder: 0
Zeco: 0, SHMEM: 2120, Resctrl: 0
ADPDRV: 0, PPK_CI: 14033552, PPK_CORR: 0

Cookies:
1560053731 total, 0 alloc, 0 free,
604382 dup, 2000217892 get, 67806603 put,
2737444945 len, 845540128 cached len, 0 chain alloc,
0 chain free

Connections:
126859219 total, 54315325 TCP, 46211064 UDP, 26324361 ICMP,
8469 other, 432 anticipated, 124399 recovered, 28625 concurrent,
33910 peak concurrent

Fragments:
2568 fragments, 1214 packets, 0 expired, 0 short,
0 large, 0 duplicates, 0 failures

NAT:
33276078/0 forw, 29235265/0 bckw, 57748477 tcpudp,
4785754 icmp, 18047884-7404210 alloc

 

# fw tab -t connections -s
HOST NAME ID #VALS #PEAK #SLINKS
localhost connections 8158 30267 33736 90599

0 Kudos
the_rock
Legend
Legend

Hey Teddy,

I would check values as per my screenshot below. Whenever I help people with such issues, this is literally the first thing I verify and I ALWAYS make sure bottom option is checked, I find it important.

Andy

Screenshot_1.png

0 Kudos
Teddy_Brewski
Collaborator

Many thanks @the_rock -- I'll give it a try.

I always thought about the fixed value as a precaution measure. What happens if it gets saturated?  I think we have enough RAM, but will it grow continuously until the gateway crashes?

0 Kudos
the_rock
Legend
Legend

I had escalation engineer explained that to me while back and made total sense...he told me the way it works when its automatic is that gateway "decides" how the consumption (if you will) is utilized, so you definitely should not see saturation.

Hope that helps.

Andy

Chris_Atkinson
Employee Employee
Employee

Have you manually set the connection limit or is it auto.

How many cores are assigned/licensed?

CCSM R77/R80/ELITE
0 Kudos
Teddy_Brewski
Collaborator

Hi @Chris_Atkinson 

It's set manually to 55k (I increased it yesterday from 35k).

The server is HP ProLiant DL360 Gen10, Xeon Gold 5122 @ 3.60GHz (4 cores), licensed to 4 cores.

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Agree with the aggressive aging analysis here with the fixed value. 

Have you had memory utilization issues in the past that has lead you to constrain the connection table?

 

CCSM R77/R80/ELITE
0 Kudos
Teddy_Brewski
Collaborator

Thank you -- no issues with RAM, I always thought it was a precaution measure. In our case it is a standard gateway, but with VSX you can't even set it to automatic.

0 Kudos
Chris_Atkinson
Employee Employee
Employee

You should be good to increase it more in that case, if you choose to keep it manually set you will need to continue to periodically manage it to stay ahead of growth / demand to avoid issues such as this.

You are correct regarding VSX note the default there is now 50,000 as of R82.

CCSM R77/R80/ELITE
0 Kudos
the_rock
Legend
Legend

@Teddy_Brewski 

To add to all that @Chris_Atkinson told you, I can remember at least 6 customers who I helped with in the past for this problem and they all had number of connection manually set. As soon as we changed it to auto and set drop optimization option and installed the policy, it fixed the issue.

I cant guarantee same would happen in your case, but Im confident it would help.

Andy

0 Kudos
Teddy_Brewski
Collaborator

I appreciate your help @the_rock and @Chris_Atkinson !

Changed to Automatic with Drop Optimization checked -- let's hope that was it!

0 Kudos
the_rock
Legend
Legend

Great! I really hope it helps.

Btw, maybe in 1-2 hours time, run watch -d fw ctl pstat command and see what you get. Example in my R82 lab.

Andy

Every 2.0s: fw ctl pstat Tue Feb 4 11:01:50 2025


System Capacity Summary:
Memory used: 8% (1554 MB out of 17429 MB) - below watermark
Concurrent Connections: 17 (Unlimited)
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 1824522240 bytes in 445440 (4096 bytes) blocks using
1 pool
Total memory bytes used: 0 unused: 1824522240 (100.00%) peak: 38
9180804
Total memory blocks used: 0 unused: 445440 (100%) peak: 9731
1
Allocations: 4111958150 alloc, 0 failed alloc, 4110732366 free

System kernel memory (smem) statistics:
Total memory bytes used: 3012513700 peak: 3271060328
Total memory bytes wasted: 30887124
Blocking memory bytes used: 29797016 peak: 30149436
Non-Blocking memory bytes used: 2982716684 peak: 3240910892
Allocations: 855986 alloc, 0 failed alloc, 843799 free, 0 failed free

0 Kudos
Teddy_Brewski
Collaborator

Hi @the_rock 

Here is how it looks in the morning:

# fw ctl pstat

System Capacity Summary:
Memory used: 4% (2242 MB out of 47603 MB) - below watermark
Concurrent Connections: 24928 (Unlimited)
Aggressive Aging is enabled, not active

Hash kernel memory (hmem) statistics:
Total memory allocated: 4991221760 bytes in 1218560 (4096 bytes) blocks using 1 pool
Total memory bytes used: 0 unused: 4991221760 (100.00%) peak: 139892588
Total memory blocks used: 0 unused: 1218560 (100%) peak: 40232
Allocations: 1704338159 alloc, 0 failed alloc, 1703788411 free

Not sure whether this is a side effect or a coincidence, but we've experienced a failover during the night due to the high CPU observed on a primary node:

[Wed Feb 5 03:00:29 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (85%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:00:51 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 22 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:00:52 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (84%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:12 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 20 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:01:13 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:23 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:01:24 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (87%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:01:35 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 11 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:02:03 2025] [fw4_1];check_other_machine_activity: Update state of member id 2 to DEAD, didn't hear from it since 476171.1 and now 476174.4
[Wed Feb 5 03:02:03 2025] [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
[Wed Feb 5 03:02:05 2025] [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:02:08 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond2
[Wed Feb 5 03:02:08 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond3
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local probing has stopped on interface: bond3
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
[Wed Feb 5 03:02:09 2025] [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
[Wed Feb 5 03:02:14 2025] [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond4 is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:02:18 2025] [fw4_1];CLUS-120207-1: Local probing has started on interface: bond3
[Wed Feb 5 03:02:18 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE ON
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-120207-1: Local probing has stopped on interface: bond3
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-120207-1: Local Probing PNOTE OFF
[Wed Feb 5 03:02:26 2025] [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
[Wed Feb 5 03:02:32 2025] [fw4_1];CLUS-220201-1: Starting CUL mode because CPU usage (98%) on the remote member 2 increased above the configured threshold (80%).
[Wed Feb 5 03:02:43 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
[Wed Feb 5 03:24:29 2025] [fw4_1];CLUS-120200-1: Starting CUL mode because CPU-02 usage (81%) on the local member increased above the configured threshold (80%).
[Wed Feb 5 03:25:43 2025] sched: RT throttling activated
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
[Wed Feb 5 03:26:19 2025] [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
[Wed Feb 5 03:26:39 2025] [fw4_1];CLUS-120202-1: Stopping CUL mode after 45 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

Tried to find out exact numbers with 'cpview -t', but the closest one I got to was 03:00:36 with the CPU reaching 56%.

Interesting that it reported bond4 as down, although I couldn't find anything in the logs of the corresponding switch.

 

0 Kudos
the_rock
Legend
Legend

What does cpu show from cpview?

Andy

0 Kudos
Teddy_Brewski
Collaborator

Right now it's 35-40%, at 3am it was around 56%, but with 'cpview -t' I can only move +- 1 minute, and I don't see high CPU the minute before/after.

0 Kudos
the_rock
Legend
Legend

You can do cpview -t, then press t and it will give an option to enter date/time.

Andy

0 Kudos
Teddy_Brewski
Collaborator

Yep, exactly that. And when I enter 03:01:24 I see 56% CPU load, minute after/before it's back to 25-30%.  I think it happened within seconds and couldn't be replayed.

0 Kudos
the_rock
Legend
Legend

Got it. Is the same fw master as before or its flipped over now? 

Andy

0 Kudos
Teddy_Brewski
Collaborator

The same.

0 Kudos
the_rock
Legend
Legend

K, perfect...so, how is the situation currently? Any better than before?

Andy

0 Kudos
Chris_Atkinson
Employee Employee
Employee

No worries, if you want to optimize connection table size things like tweaking the default DNS service timeout can be effective e.g.

R80.x Performance Tuning Tip - Connection Table - Check Point CheckMates

CCSM R77/R80/ELITE
0 Kudos
spottex
Collaborator

Is the traffic traversing two firewalls to its destination by any chance?
It maybe Smart Connection Reuse feature keeping a closed session open on one of the firewalls - probably client side which then converts the SYN to an ACK which the second FW will drop as the session does not exist.

0 Kudos
Teddy_Brewski
Collaborator

Hello @spottex 

It does. It hits external VSX, then internal (regular) firewall.

We did try to disable Connection Reuse feature on internal firewall, however, it didn't help.

To reduce the number of variables, we replicated the setup on internal firewall only and still experienced the same disconnections, so decided to focus on internal gateway only for the moment.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events