Solved: Re: High CPU on Security Gateway caused by RAD ser...

lraaicfdb · ‎2024-10-24

Hello,

since Friday October 18, 2024 8:57 a.m. (GMT+2, Germany) we have slow internet access and the following logs appear on the Security Gateway: RAD request exceeded maximum handing time

In addition, the CPU of the Security Gateway is very high at irregular intervals (normal at 20 percent, increased at 60 percent). In cpview you can see that the active flows of the RAD service are increasing at this time. I found older posts in the forum in which users mentioned the problem at Checkpoint. It was confirmed by TAC at that time.

We have activated the blades IPS, Threat Emulation and Anti-Bot.
Does anyone else have the problem, for example in Munich, Germany?

lraaicfdb · ‎2024-11-04

I think that was actually the solution. The RAD service probably had too many requests to process. Using the Checkpoint Database Tool (GuiDBedit.exe - found in the PROGRAM folder of the Smartconsole installation folder), I increased the value "rad_max_concurrent_requests" from 2500 to 5000. It can be found under Other → rad_services → rad_settings.
Since then, the CPU usage has returned to the normal 20 percent.

View solution in original post

_Val_ · ‎2024-10-24

Please look into sk180800

lraaicfdb · ‎2024-10-25

Thank you for the tip, which we will test. However, I don't understand why the problem has suddenly appeared when there hasn't been a problem for years. We have been using the Security Gateway for a long time.

Lesley · ‎2024-10-25

I mean applications around it get's changed, the traffic load. Systems are getting updated or replaced.

All this traffic goes via the fw. Not the mention the changes on the firewall itself from updates etc.

So the statement nothing has changed is never true. Otherwise we would be out of work quite soon.

For now ontopic:

Please share output: $FWDIR/conf/rad_conf.C

cpinfo -y all

-------
Please press "Accept as Solution" if my post solved it 🙂

Timothy_Hall · ‎2024-10-26

RAD is heavily dependent on voluminous DNS lookups and speedy DNS responses. Check the DNS server settings in the Gaia OS of the gateway, and then use nslookup/dig to run some lookups against each individual server to ensure they are working and responding quickly. If the primary DNS server is not working or responding slowly the impact will be noticeable.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

lraaicfdb · ‎2024-10-29

Hello everyone,
Thank you for the tips that I have tested over the last few days.

I changed the DNS servers for external queries, but without success. I also checked whether the Security Gateway can reach the Checkpoint servers (using curl_cli http://cws.checkpoint.com/Malware/SystemStatus/type/short). This also worked without problems and the response came immediately.
I also increased the value "amws_service_check_second" from 1800 to 7200 and the value "max_flows" from 1000 to 1500 in rad_conf.C. This was mentioned in some SKs.
Another idea was to check the malware cache and the URLF cache. However, the values that the commands "fw tab -t malware_cache_tbl -s" and "fw ctl multik print_bl urlf_cache_tbl -s" gave me seem to be OK and remain relatively stable. There are no big jumps visible here.

Unfortunately, nothing has worked. Do you have any other tips that I can check?

Timothy_Hall · ‎2024-10-29

High CPU utilization by the RAD daemon may just be the symptom, not the cause.

Are you sure that any routers/switches your organization manages upstream of the firewall have not been replaced? Upgraded? Any new DoS defenses or web content caching? Mirror and decrypt to store HTTPS traffic for compliance/regulatory purposes?

Next step is to have a frank discussion with your ISP, and I don't mean with their sales guy. Have they recently introduced some kind of "helpful" DoS protections or Intrusion Detection/Prevention? Started shaping/limiting traffic due to their network being oversubscribed? (doubt they'll admit to this without some harsh conversations) Changed or reduced their peering with the rest of the Internet? Are you located in a country that might be interfering with Internet traffic in any way due to geopolitical concerns or having their enemies doing the same? I had a student attending a class from Russia last month and they kept getting disconnected, and had to keep routing through different VPNs to be able to continue attending my class hosted from the USA.

If you have any historical metrics you can use to assess your Internet bandwidth I'd strongly recommend taking a good long look at them; my guess is you will find something has changed upstream.

Edit: See this response: https://community.checkpoint.com/t5/General-Topics/RAD-issues-Timeout-and-quot-RAD-request-exceeded-...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Chris_Atkinson · ‎2024-10-30

Are you seeing any of the log entries related to sk182494?

CCSM R77/R80/ELITE

lraaicfdb · ‎2024-10-31

No, our error messages are:
Flow Error=RAD request exceeded maximum handling time
But thanks for the tip.

It definitely has something to do with the Anti-Bot Blade. Devices where https inspection and therefore Anti-Bot have been disabled have a fast internet speed (even if the CPU is high). It looks as if the Anti-Bot Blade or the RAD service sometimes cannot process the requests quickly enough. I am currently trying to find out where the error is in various tests.

Timothy_Hall · ‎2024-10-31

In the meantime you can set Anti-bot to Background while leaving Anti-Virus set to Hold. Anti-bot is a post-infection blade, if it makes an infected host identification we don't have to stop the traffic 100% with Hold (along with the possible performance hit you are seeing), as the host is already compromised anyway. But Zero Phishing and especially Anti-Virus should most definitely be Hold since they are pre-infection/disclosure blades, however they typically have far less traffic to inspect than Anti-bot. This is actually a recommendation made in the new official Check Point Threat Prevention Specialist (CTPS) course:

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

lraaicfdb · ‎2024-10-31

Thank you for the tip, I could definitely implement that.

At the moment it looks like I have found the error. I increased the value "rad_max_concurrent_requests" using GuiDBedit. But I still have to verify whether this is not a temporary improvement and will report after the weekend.

lraaicfdb · ‎2024-11-04

I think that was actually the solution. The RAD service probably had too many requests to process. Using the Checkpoint Database Tool (GuiDBedit.exe - found in the PROGRAM folder of the Smartconsole installation folder), I increased the value "rad_max_concurrent_requests" from 2500 to 5000. It can be found under Other → rad_services → rad_settings.
Since then, the CPU usage has returned to the normal 20 percent.

maad-pul · ‎2025-05-27

Hi All,

When patched to R81.20 TAKE99 (which was done yesterdag) is SK182859 relevant then?

I still see som problems in Anti-Bot & Anti-Virus Blade, but much fewer then before patching.

Should I still implement this, the "Solution" based on this threat? Edit "rad_max_concurrent_requests" from 2500 to 5000"
https://community.checkpoint.com/t5/Security-Gateways/High-CPU-on-Security-Gateway-caused-by-RAD-ser...

PRJ-58090,
PMTR-109845

Security Gateway

When the autodebug feature is enabled, the RAD service may consume high CPU and trigger "RAD service not available" alert logs.

/Mattias

Are you a member of CheckMates?

High CPU on Security Gateway caused by RAD service / Slow Internet