- CheckMates
- :
- Products
- :
- Quantum
- :
- Security Gateways
- :
- Re: High CPU on Security Gateway caused by RAD ser...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
High CPU on Security Gateway caused by RAD service / Slow Internet
Hello,
since Friday October 18, 2024 8:57 a.m. (GMT+2, Germany) we have slow internet access and the following logs appear on the Security Gateway: RAD request exceeded maximum handing time
In addition, the CPU of the Security Gateway is very high at irregular intervals (normal at 20 percent, increased at 60 percent). In cpview you can see that the active flows of the RAD service are increasing at this time. I found older posts in the forum in which users mentioned the problem at Checkpoint. It was confirmed by TAC at that time.
We have activated the blades IPS, Threat Emulation and Anti-Bot.
Does anyone else have the problem, for example in Munich, Germany?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that was actually the solution. The RAD service probably had too many requests to process. Using the Checkpoint Database Tool (GuiDBedit.exe - found in the PROGRAM folder of the Smartconsole installation folder), I increased the value "rad_max_concurrent_requests" from 2500 to 5000. It can be found under Other → rad_services → rad_settings.
Since then, the CPU usage has returned to the normal 20 percent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please look into sk180800
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the tip, which we will test. However, I don't understand why the problem has suddenly appeared when there hasn't been a problem for years. We have been using the Security Gateway for a long time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I mean applications around it get's changed, the traffic load. Systems are getting updated or replaced.
All this traffic goes via the fw. Not the mention the changes on the firewall itself from updates etc.
So the statement nothing has changed is never true. Otherwise we would be out of work quite soon.
For now ontopic:
Please share output: $FWDIR/conf/rad_conf.C
cpinfo -y all
If you like this post please give a thumbs up(kudo)! 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RAD is heavily dependent on voluminous DNS lookups and speedy DNS responses. Check the DNS server settings in the Gaia OS of the gateway, and then use nslookup/dig to run some lookups against each individual server to ensure they are working and responding quickly. If the primary DNS server is not working or responding slowly the impact will be noticeable.
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone,
Thank you for the tips that I have tested over the last few days.
I changed the DNS servers for external queries, but without success. I also checked whether the Security Gateway can reach the Checkpoint servers (using curl_cli http://cws.checkpoint.com/Malware/SystemStatus/type/short). This also worked without problems and the response came immediately.
I also increased the value "amws_service_check_second" from 1800 to 7200 and the value "max_flows" from 1000 to 1500 in rad_conf.C. This was mentioned in some SKs.
Another idea was to check the malware cache and the URLF cache. However, the values that the commands "fw tab -t malware_cache_tbl -s" and "fw ctl multik print_bl urlf_cache_tbl -s" gave me seem to be OK and remain relatively stable. There are no big jumps visible here.
Unfortunately, nothing has worked. Do you have any other tips that I can check?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
High CPU utilization by the RAD daemon may just be the symptom, not the cause.
Are you sure that any routers/switches your organization manages upstream of the firewall have not been replaced? Upgraded? Any new DoS defenses or web content caching? Mirror and decrypt to store HTTPS traffic for compliance/regulatory purposes?
Next step is to have a frank discussion with your ISP, and I don't mean with their sales guy. Have they recently introduced some kind of "helpful" DoS protections or Intrusion Detection/Prevention? Started shaping/limiting traffic due to their network being oversubscribed? (doubt they'll admit to this without some harsh conversations) Changed or reduced their peering with the rest of the Internet? Are you located in a country that might be interfering with Internet traffic in any way due to geopolitical concerns or having their enemies doing the same? I had a student attending a class from Russia last month and they kept getting disconnected, and had to keep routing through different VPNs to be able to continue attending my class hosted from the USA.
If you have any historical metrics you can use to assess your Internet bandwidth I'd strongly recommend taking a good long look at them; my guess is you will find something has changed upstream.
Edit: See this response: https://community.checkpoint.com/t5/General-Topics/RAD-issues-Timeout-and-quot-RAD-request-exceeded-...
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you seeing any of the log entries related to sk182494?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, our error messages are:
Flow Error=RAD request exceeded maximum handling time
But thanks for the tip.
It definitely has something to do with the Anti-Bot Blade. Devices where https inspection and therefore Anti-Bot have been disabled have a fast internet speed (even if the CPU is high). It looks as if the Anti-Bot Blade or the RAD service sometimes cannot process the requests quickly enough. I am currently trying to find out where the error is in various tests.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the meantime you can set Anti-bot to Background while leaving Anti-Virus set to Hold. Anti-bot is a post-infection blade, if it makes an infected host identification we don't have to stop the traffic 100% with Hold (along with the possible performance hit you are seeing), as the host is already compromised anyway. But Zero Phishing and especially Anti-Virus should most definitely be Hold since they are pre-infection/disclosure blades, however they typically have far less traffic to inspect than Anti-bot. This is actually a recommendation made in the new official Check Point Threat Prevention Specialist (CTPS) course:
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the tip, I could definitely implement that.
At the moment it looks like I have found the error. I increased the value "rad_max_concurrent_requests" using GuiDBedit. But I still have to verify whether this is not a temporary improvement and will report after the weekend.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that was actually the solution. The RAD service probably had too many requests to process. Using the Checkpoint Database Tool (GuiDBedit.exe - found in the PROGRAM folder of the Smartconsole installation folder), I increased the value "rad_max_concurrent_requests" from 2500 to 5000. It can be found under Other → rad_services → rad_settings.
Since then, the CPU usage has returned to the normal 20 percent.
