Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
lraaicfdb
Participant

High CPU on Security Gateway caused by RAD service / Slow Internet

Hello,

since Friday October 18, 2024 8:57 a.m. (GMT+2, Germany) we have slow internet access and the following logs appear on the Security Gateway: RAD request exceeded maximum handing time

In addition, the CPU of the Security Gateway is very high at irregular intervals (normal at 20 percent, increased at 60 percent). In cpview you can see that the active flows of the RAD service are increasing at this time. I found older posts in the forum in which users mentioned the problem at Checkpoint. It was confirmed by TAC at that time.

We have activated the blades IPS, Threat Emulation and Anti-Bot.
Does anyone else have the problem, for example in Munich, Germany?

0 Kudos
10 Replies
_Val_
Admin
Admin

Please look into sk180800

0 Kudos
lraaicfdb
Participant

Thank you for the tip, which we will test. However, I don't understand why the problem has suddenly appeared when there hasn't been a problem for years. We have been using the Security Gateway for a long time.

0 Kudos
Lesley
Leader Leader
Leader

I mean applications around it get's changed, the traffic load. Systems are getting updated or replaced.

All this traffic goes via the fw. Not the mention the changes on the firewall itself from updates etc.

So the statement nothing has changed is never true. Otherwise we would be out of work quite soon.

For now ontopic: 

Please share output: $FWDIR/conf/rad_conf.C

cpinfo -y all

-------
If you like this post please give a thumbs up(kudo)! 🙂
0 Kudos
Timothy_Hall
Legend Legend
Legend

RAD is heavily dependent on voluminous DNS lookups and speedy DNS responses.  Check the DNS server settings in the Gaia OS of the gateway, and then use nslookup/dig to run some lookups against each individual server to ensure they are working and responding quickly.  If the primary DNS server is not working or responding slowly the impact will be noticeable.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
lraaicfdb
Participant

Hello everyone,
Thank you for the tips that I have tested over the last few days.


I changed the DNS servers for external queries, but without success. I also checked whether the Security Gateway can reach the Checkpoint servers (using curl_cli http://cws.checkpoint.com/Malware/SystemStatus/type/short). This also worked without problems and the response came immediately.
I also increased the value "amws_service_check_second" from 1800 to 7200 and the value "max_flows" from 1000 to 1500 in rad_conf.C. This was mentioned in some SKs.
Another idea was to check the malware cache and the URLF cache. However, the values ​​that the commands "fw tab -t malware_cache_tbl -s" and "fw ctl multik print_bl urlf_cache_tbl -s" gave me seem to be OK and remain relatively stable. There are no big jumps visible here.

Unfortunately, nothing has worked. Do you have any other tips that I can check?

0 Kudos
Timothy_Hall
Legend Legend
Legend

High CPU utilization by the RAD daemon may just be the symptom, not the cause. 

Are you sure that any routers/switches your organization manages upstream of the firewall have not been replaced? Upgraded?  Any new DoS defenses or web content caching?  Mirror and decrypt to store HTTPS traffic for compliance/regulatory purposes?

Next step is to have a frank discussion with your ISP, and I don't mean with their sales guy.  Have they recently introduced some kind of "helpful" DoS protections or Intrusion Detection/Prevention?  Started shaping/limiting traffic due to their network being oversubscribed? (doubt they'll admit to this without some harsh conversations)  Changed or reduced their peering with the rest of the Internet?  Are you located in a country that might be interfering with Internet traffic in any way due to geopolitical concerns or having their enemies doing the same?  I had a student attending a class from Russia last month and they kept getting disconnected, and had to keep routing through different VPNs to be able to continue attending my class hosted from the USA.

If you have any historical metrics you can use to assess your Internet bandwidth I'd strongly recommend taking a good long look at them; my guess is you will find something has changed upstream.

Edit: See this response: https://community.checkpoint.com/t5/General-Topics/RAD-issues-Timeout-and-quot-RAD-request-exceeded-...

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Chris_Atkinson
Employee Employee
Employee

Are you seeing any of the log entries related to sk182494?

CCSM R77/R80/ELITE
0 Kudos
lraaicfdb
Participant

No, our error messages are:
Flow Error=RAD request exceeded maximum handling time
But thanks for the tip.

It definitely has something to do with the Anti-Bot Blade. Devices where https inspection and therefore Anti-Bot have been disabled have a fast internet speed (even if the CPU is high). It looks as if the Anti-Bot Blade or the RAD service sometimes cannot process the requests quickly enough. I am currently trying to find out where the error is in various tests.

0 Kudos
Timothy_Hall
Legend Legend
Legend

In the meantime you can set Anti-bot to Background while leaving Anti-Virus set to Hold.  Anti-bot is a post-infection blade, if it makes an infected host identification we don't have to stop the traffic 100% with Hold (along with the possible performance hit you are seeing), as the host is already compromised anyway.  But Zero Phishing and especially Anti-Virus should most definitely be Hold since they are pre-infection/disclosure blades, however they typically have far less traffic to inspect than Anti-bot.  This is actually a recommendation made in the new official Check Point Threat Prevention Specialist (CTPS) course:

abot_background.png

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
lraaicfdb
Participant

Thank you for the tip, I could definitely implement that.

At the moment it looks like I have found the error. I increased the value "rad_max_concurrent_requests" using GuiDBedit. But I still have to verify whether this is not a temporary improvement and will report after the weekend.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events