Solved: Re: Saturation of concurrent connections

Teddy_Brewski · ‎2023-03-20

Hello,

VSX cluster running Check Point R80.40 (Take 154) on open servers (HP) with two VSs.

On a random basis, but always during out of office hours and on weekends, we experience 1-2 minutes intermittent access due to the spikes of concurrent connections.
CPU load goes 100%, all concurrent connections are saturated and the firewall starts dropping packets. It only lasts for 2-3 minutes, and could happen once/twice per week/month. It does look like a DDoS, but it only lasts for few minutes.

I keep raising the concurrent connections value, initially from 25000 to 35000, and then to 55000, but it doesn't seem to help. We have plenty of RAM and potentially can go higher, but I'm not sure it's the right way.
I can't spot anything unusual from the SmartTracker logs during those minutes -- just regular port scans from various networks.

Any ideas/tips/hints would be greatly appreciated.

PS: Just to add that we've experienced the same with R77.30, so I don't think it's linked to the version.

PSS: The specs are: ProLiant DL360 Gen10 (Intel Xeon Gold 6144 3.50GHz (8 cores), 64GB RAM).

Thank you in advance.

Chris_Atkinson · ‎2023-03-20

What size internet link are the gateways connected to and how many users do they protect?

Do you have a number of public IPs routed towards the firewall that aren't necessarily in use?

Is there a router under your control/management located north of the firewall?

CCSM R77/R80/ELITE

View solution in original post

Chris_Atkinson · ‎2023-03-20

What size internet link are the gateways connected to and how many users do they protect?

Do you have a number of public IPs routed towards the firewall that aren't necessarily in use?

Is there a router under your control/management located north of the firewall?

CCSM R77/R80/ELITE

Teddy_Brewski · ‎2023-03-20

Thank you @Chris_Atkinson

The Internet link is 1Gb. One the firewall side, the Internet facing part is in a bond, with 2x10G, dot1q, interfaces.

There is /20 range behind, with not everything routed towards the firewall. The affected VS handles 8 /24 public networks.

I do have a router (Arista DCS-7020SR) under my control in front.

Thank you.

Chris_Atkinson · ‎2023-03-20

How are routes anchored in the environment to prevent traffic looping?

Some generic advice that might help minimize noise and prevent the traffic from getting to the Firewall in the first instance:

- Null route any of your unused public networks/subnets at the Router level

- Bogon & Martian filters / infrastructure ACLs at the Router level

Your ISP may partially implement something like the second point on your behalf already, some are better than others at doing so however.

Refer also: sk112454 - How to configure Rate Limiting rules for DoS Mitigation (R80.20 and newer)

CCSM R77/R80/ELITE

the_rock · ‎2023-03-20

Do you have this option in the cluster properties? If so, I would change it to automatic, as that way, gateway would automatically calculate cpu/memory redistribution based on amount of connections, rather than setting it up manually.

Andy

Best,
Andy

Chris_Atkinson · ‎2023-03-20

VSX / Virtual Systems don't have this option unfortunately

But with 64GB RAM you should feel comfortable in increasing the manual values considerably from defaults where needed.

CCSM R77/R80/ELITE

the_rock · ‎2023-03-20

Ah kk, thanks Chris, good to know. Last time I worked with VSX was in R77.30, so could not recall if that option was there 🙂

Best,
Andy

Teddy_Brewski · ‎2023-03-20

Our local CP support proposed to keep increasing the number, so I jumped to 55000, which was still saturated,

Sal_Previtera · ‎2023-03-20

Check DNS traffic, lot of users may change to an external DNS or trying to hide behind encrypt DNS in order to bypass your inspection...disable any external DNS from the user traffic, if you can,

Teddy_Brewski · ‎2023-03-20

Thank you for the suggestion @Sal_Previtera

DNS traffic from internal users to external DNS servers is blocked. Also, it always happens outside of working hours (for example Friday, 11pm), with nobody present/connected in the office.

It looks like it's an external attempt (massive port scan?), but I can't get enough evidence during this several minutes incident.

Chris_Atkinson · ‎2023-03-20

Admittedly less relevant here it seems but some also find it helpful to lower the default timeout for the DNS service.

CCSM R77/R80/ELITE

the_rock · ‎2023-03-20

Not to sound ironic now, but that would be same if car mechanic told you to keep adding oil constantly, though you know its leaking, that wont fix the problem permanently, its not even a good workaround. We need to find out WHY its happening, so it can be fixed once for all.

Any clue when this started? Were there any changes made that you can recall that would have caused such a behavior?

Best,
Andy

Sal_Previtera · ‎2023-03-20

My response was based on my personal experience with traffic originating internally for DNS traffic, with saturated connections on our Free internet supplied to our customers...may I should have it stated that earlier.

Most user were trying to bypass our inspection with external or encrypt DNS, now that some browsers use.

In your case, if it is external traffic reaching your firewalls, may need to use some anti-DDOS or suspicious rules...

You may want to log external traffic being dropped until you find the source or multiple sources...if it is not logged.

https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_LoggingAndMonitoring_AdminGuide/To...

the_rock · ‎2023-03-20

Hey Sal,

No no, your response was excellent, I commented on the suggestion Teddy gave from local CP office to keep increasing the amount of connections limit. Personally, I dont think thats even good workaround. What you gave makes total sense.

Best,
Andy

Teddy_Brewski · ‎2023-03-20

I fully agree. 😀

I've seen it with R77.30 before (a year ago). Different hardware though (also slighty oversized when it comes to RAM) and no VSX. It was the same behavior, and if I'm not wrong, it was fixed (or hid?) by choosing Automatic Capacity Calculation, so it's quite possible it was always there. What changed now is the frequency.

the_rock · ‎2023-03-20

Can you send us output of below please Teddy?

fw ctl pstat

fw ctl multik print_heavy_conn

Andy

Best,
Andy

Teddy_Brewski · ‎2023-03-20

@the_rock wrote:
Can you send us output of below please Teddy?
fw ctl pstat
fw ctl multik print_heavy_conn
Andy

Thank you Andy. Here it is:

# fw ctl pstat

Virtual System Capacity Summary:
  Physical memory used:   4% (2666 MB out of 53949 MB) - below watermark
  Kernel   memory used:   1% (708 MB out of 53949 MB) - below watermark
  Virtual  memory used:   2% (1405 MB out of 62200 MB) - below watermark
     Used: 237 MB by FW, 1152 MB by zeco
  Concurrent Connections: 28% (15713 out of 54900) - below watermark
  Aggressive Aging is enabled, not active

Kernel memory (kmem) statistics:
  Total memory  bytes  used: 102148474   peak: 145904923
  Allocations: 0 alloc, 0 failed alloc
               0 free, 0 failed free

Cookies:
        2273050826 total, 0 alloc, 0 free,
        841513 dup, 4160716724 get, 8779887 put,
        360874770 len, 1787451782 cached len, 0 chain alloc,
        0 chain free

Connections:
        168007326 total, 107162621 TCP, 47762740 UDP, 13044116 ICMP,
        37849 other, 1769045 anticipated, 29370 recovered, 15713 concurrent,
        54900 peak concurrent

Fragments:
        108 fragments, 24 packets, 0 expired, 0 short,
        0 large, 0 duplicates, 0 failures

NAT:
        2645093/0 forw, 2036238/0 bckw, 2773631 tcpudp,
        2115370 icmp, 1388763-712045 alloc

Sync: Run "cphaprob syncstat" for cluster sync statistics.

'fw ctl multik print_heavy_conn' returned nothing.

~15k connections is how our normal evening looks like.

the_rock · ‎2023-03-20

Ok, that looks good, I mean, 28% is nothing, wayyy below limit. By the way, can you also run cpview and tab between different fields to see if there is anything of interest there?

Andy

Best,
Andy

Teddy_Brewski · ‎2023-03-20

This is how CPU and concurrent connections look like at the time of the incident. I extracted it from 'cpview -t'.

2-3 minutes before and after were perfectly normal values.

the_rock · ‎2023-03-20

Yea, that looks fine to me as well. Maybe worth TAC investigating, but then again, if issue is not there, except weekend, not sure how much they can do either. Is it possible to have someone available when this occurs on the weekend or thats sort of random too, does not always happen at the same time?

Andy

Best,
Andy

the_rock · ‎2023-03-20

Hey @Teddy_Brewski , something to maybe verify. I found an email from few years ago where I worked with customer who had similar issue (though this was Fortinet) and they discovered it was a scanning machine in their network causing this, since it was scheduled to run scans of large portion of their network on the weekends.

Not saying thats the case with you, but wanted to confirm.

Best,
Andy

Teddy_Brewski · ‎2023-03-20

I thought about that too, including some backups happening in the network, but unfortunately, can't spot the pattern yet: it's not happening on a recurring basis, always during weekends and in the evenings, but random dates. Too quick to catch in real time -- by the time I login it's already back to normal.

the_rock · ‎2023-03-20

Hm, I wonder if there is a way to set up some sort of monitoring on it to catch more info that would be useful.

Best,
Andy

Timothy_Hall · ‎2023-03-20

Short answer: You need to download and fire up @HeikoAnkenbrand's awesome econn tool (Easy Tool - R81.20 Real time connection table analysis v5.0 ) while the overflow is happening, it will slice and dice the connections table multiple ways and allow you to identify the "top talkers" or whatever traffic is causing the connections table overflow. Never tried it on VSX but as long as you are in the proper VS I assume it would work.

Long Answer: This exact scenario covered extensively in my Gateway Performance Optimization R81.20 class and actually experienced/corrected in a lab exercise as well. Here are the relevant pages:

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Sorin_Gogean · ‎2023-03-21

Hello @Teddy_Brewski ,

As we were facing some HIGH Connection Spikes in our environment, and we were blind - as in not be able to see the traffic at that particular moment it was happening - we've created a script that runs in the background and watches the current connections through time.

Collecting those current connections values, we calculate an average for the last hour, and in case the number of connections from this moment is bigger than the 1hour average plus 25% (or whatever value you can consider a valid increase) then we trigger the data gathering.

Data gathering, it means it's dumping all "$FW ctl conntab " into a file, then we parse that file and report on the top 5 IP's that are source and top 5 IPs that are destination and more than that, on each one from those 5 we report top 10 IP's that hare high connetions.

In other words, if you have a public DNS behind the CheckPoint, and that has high number of connections, then we will show top 10 source IP's towards that DNS server.

This script helped us see what were the spikes we encounter - like 1 - 1.5MIL connections that we got lowered to 200-300K connections now 😁.

All those things are done in the background, and the report get's emailed, plus you can email the FWL connection export (still it's an HUGE file - like 40 - 50Mb) .

So, if this it would help, then let me know and we can discuss this week, share the script here and walk over .

Thank you,

PS: we intend to share that script here on CheckMates, but there are still some parts in work ....

Teddy_Brewski · ‎2023-03-21

Thank you @Sorin_Gogean -- greatly appreciated! I'd be happy to try your script -- we're on R80.40. If it can be only shared privately for the time being, I can contact you offline.

Sorin_Gogean · ‎2023-03-21

@Teddy_Brewski , I'll share the script with you, no worries.
Let me just adapt it to your needs and then we can meet for some time and guide you through - it's quite easy.

Like I said, we did it for our needs and I would really want to check it on other environments .

Ty,

_Val_ · ‎2023-03-21

Stressing the point of sharing the script in the community, @Sorin_Gogean 🙂

Sorin_Gogean · ‎2023-03-21

@_Val_ , we'll do that for sure, if you have some guidelines that I cold look on how the code should be, it would be great.

the_rock · ‎2023-03-21

Happy to try it in my lab if you are willing to share 🙂

Best,
Andy

Are you a member of CheckMates?

Saturation of concurrent connections