Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Neil_ZInk
Collaborator
Jump to solution

RAD kernel errors

Platform:  r80.10  take 42 (and 56)

I am seeing thousands of kernel errors in the  /var/log/message file  (several hundred per minute)

Jan 12 10:44:05 2018 xxxx kernel: [fw4_3];[ERROR]: appi_rad_uf_cmi_handler_match_cb: appi_rad_uf_cmi_handler_server_response() failed
Jan 12 10:44:05 2018 xxxx kernel: [fw4_20];[ERROR]: appi_rad_uf_cmi_handler_server_response: no hello done, failed

I have had ticket open with TAC since we installed r80.10 (Nov 1)  with no resolve.

has any one else seen this?

thanks in advance.

0 Kudos
1 Solution

Accepted Solutions
Ewane_Junior
Participant

Neil,

We faced similar issue sometime ago and had to work onsite with a PS team with which we increased the RAD connections.

ckp_regedit -a SOFTWARE\\CheckPoint\\FW1\\$(cpprod_util CPPROD_GetCurrentVersion FW1) RAD_QUERIES_NUMBER_PER_CONNECTION 400

Do check

How to modify URL Filtering cache size? 

View solution in original post

17 Replies
Timothy_Hall
Legend Legend
Legend

Please provide output of enabled_blades command run on firewall. 

How busy is the rad daemon as shown in top when these messages are happening?

Are you using full-fledged HTTPS Inspection, the "Categorize HTTPS Sites" checkbox, or no HTTPS Inspection at all?

Based on the messages you have provided, can't really tell if there is some kind of problem between the firewall kernel and the rad daemon, or the rad daemon and the Check Point ThreatCloud.  If you suspect the latter, you could try turning on rad daemon statistics with rad_admin stats on appi then head to the "Advanced...RAD" screen of cpview to see if certain error counters increment in sync with the error messages getting spewed into the syslog.  Don't forget to turn off statistics with rad_admin stats off appi when done looking!

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Neil_ZInk
Collaborator

Tim thanks for the reply

blades: fw urlf av appi ips dlp identityServer anti_bot ThreatEmulation

HTTPS inspection is not enable on Check Points.   we are doing HTTPS inspection with another vendor in sandwich model.  Categorize HTTPS sites is checked

I did see anything error in CPVIEW with appi or urlf

I sometimes see a spike in RAD process when the message appear.

normal:

Hardware: 15600s with enhancement pack.

0 Kudos
Timothy_Hall
Legend Legend
Legend

Hmm how many filtered users behind the firewall? 

Please provide output of following commands, run every 10 minutes or so:

fw tab -t urlf_cache_tbl -s

fw tab -t appi_cache_tbl -s

Am wondering if it is this overflow situation which is described in my book: sk90422: How to modify URL Filtering cache size?

If so might be fun to manually clear the APPI and/or URLF cache tables and see if it causes the messages to spew as described here (may cause a slowdown/outage though):  sk64280: How to clear URL Filtering kernel cache?

Also bit of a long shot, but have a look in $FWDIR/log/rad.elg to see if anything interesting is getting logged into it...

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Neil_ZInk
Collaborator

thanks again for quick reply

$FWDIR/log/rad.elg  is coming up with following intermittent error: 

*** RAD CONNECTION ERROR [Fri-12/0/2108-15:41:39] ***
Request: GET /URLF/urlf/1.0?resource=bXNuLmNvbQ==&key=123456 HTTP/1.1
Connection: Keep-Alive
User-Agent: RAD_CLIENT
Host: cws.checkpoint.com:80

TAC has seen this error and had me increase the connection limit (did not help)

ckp_regedit -a SOFTWARE\\CheckPoint\\FW1\\$(cpprod_util CPPROD_GetCurrentVersion FW1) RAD_QUERIES_NUMBER_PER_CONNECTION 40

as for clearing the Cache..  upon reboot we see the error right away so I dont think clearing cache has any effect


# fw tab -t urlf_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS

localhost urlf_cache_tbl 88 14707 0 0
# fw tab -t urlf_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS

localhost urlf_cache_tbl 88 15390 0 0
# fw tab -t urlf_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS

localhost urlf_cache_tbl 88 15966 0 0

# fw tab -t appi_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS

localhost appi_cache_tbl 95 0 0 0

# fw tab -t appi_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS l

ocalhost appi_cache_tbl 95 0 0 0
# fw tab -t appi_cache_tbl -s

HOST NAME ID #VALS #PEAK #SLINKS

localhost appi_cache_tbl 95 0 0 0

0 Kudos
Neil_ZInk
Collaborator

we have 30k+ users behind the gateway.   we did not have the issue with r77.30  13800s chassis.

0 Kudos
Timothy_Hall
Legend Legend
Legend

Check out sk90422: How to modify URL Filtering cache size?.  The URL filtering cache has a fixed size of 20k.  When that limit is reached, the entire cache is *cleared*.  This causes a big flurry of URL categorization lookups by rad (and high CPU load) as it repopulates the cache.  If this keeps happening over and over again it can cause various issues. The urlf_cache_tbl table size you provided shows you are close to that 20k, that really isn't a problem unless you are constantly banging against the limit.

Might be worth trying to increase the cache size as stated in the SK; according to the SK the default 20k size is suitable for 1,000 filtered users and you have way way more than that.

Also are you using the new Mellanox cards that are faster than 10Gbps?  Is hyperthreading turned on?

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Neil_ZInk
Collaborator

Tim thanks again for thoughts.

we are not using the Mellanox Cards.  Hyperthreading is turned on.

changed Cache from 20k - 80k.   Same errors in message file. 

new errors  is rad.elg file

[rad 23961]@xx [15 Jan 8:06:07] Warning:cp_timed_blocker_handler: A handler [0xf7abf6e0] blocked for 14 seconds.
[rad 23961]@xx [15 Jan 8:06:07] Warning:cp_timed_blocker_handler: Handler info: Library [/opt/CPshrd-R80/lib/libComUtils.so], Function offset [0x156e0].
[rad 23961]@xx[15 Jan 8:11:45] Warning:cp_timed_blocker_handler: A handler [0xf75b1d80] blocked for 5 seconds.
[rad 23961]@xx[15 Jan 8:11:45] Warning:cp_timed_blocker_handler: Handler info: Library [/opt/CPshrd-R80/lib/libkiss_apps.so], Function offset [0xf4d80].
[rad 23961]@xx[15 Jan 8:15:04] Warning:cp_timed_blocker_handler: A handler [0xf7dafa30] blocked for 12 seconds.
[rad 23961]@xx [15 Jan 8:15:04] Warning:cp_timed_blocker_handler: Handler info: Library [/opt/CPsuite-R80/fw1/lib/librad.so], Function offset [0x66a30].
[rad 23961]@xx [15 Jan 8:15:04] Warning:cp_timed_blocker_handler: Handler info: Nearest symbol name [_ZN10CRadFwConn18static_handle_dataEP7_FwConnPvj], offset

0 Kudos
Alex_Rozhko
Employee
Employee

Tim,

I like Check Point's flexibility with buffers and/or caches manipulation (used them quite bit) but "something gotta give" when those resources taken. Can you elaborate bit more on cache increases? What is the limit/threshold if any? What should be taken into consideration when default values for caches/buffers changed?

0 Kudos
Timothy_Hall
Legend Legend
Legend

sk90422: How to modify URL Filtering cache size? describes the situation pretty well, which is high CPU consumption by the rad daemon and noticeable user delays when first trying to connect to websites when categorization is set to "Hold".  There are no cache size limits I'm aware of, but cranking any variable to an arbitrarily large number can have very bad effects in some cases.  Doubling it, trying again, then doubling it again if needed is generally safe.  If the original value is a power of 2 keeping it a power of 2 is recommended.  The cache value should NOT be changed though unless you have conclusively verified that the situation described in the SK is occurring.

Obviously increasing the cache size uses more memory (which is generally plentiful on the latest Check Point firewall appliances), but depending on how the cache is searched (sequential vs. hashed/indexed - not sure which) allowing it to grow bigger may cause slightly more CPU overhead for every single cache lookup.  Guess that would have to be weighed against the impact of the cache overflowing/clearing and filtered web traffic coming to a screeching halt as the cache is repopulated.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Timothy_Hall
Legend Legend
Legend

Getting into the weeds now, please determine the process identifier (PID) for rad, assume it is 4964 for our example.  Please provide output of following:

cat /proc/4964/limits

cat /proc/4964/status

cat /proc/4964/net/netstat

Wondering if it is some kind of resource limit issue...

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Neil_ZInk
Collaborator

0 Kudos
Ewane_Junior
Participant

Neil,

We faced similar issue sometime ago and had to work onsite with a PS team with which we increased the RAD connections.

ckp_regedit -a SOFTWARE\\CheckPoint\\FW1\\$(cpprod_util CPPROD_GetCurrentVersion FW1) RAD_QUERIES_NUMBER_PER_CONNECTION 400

Do check

How to modify URL Filtering cache size? 

Neil_ZInk
Collaborator

Update

adding the following command helped for little while.   Then the messages started flooding back in.

ckp_regedit -a SOFTWARE\\CheckPoint\\FW1\\$(cpprod_util CPPROD_GetCurrentVersion FW1) RAD_QUERIES_NUMBER_PER_CONNECTION 400

I could have been running into a NAT exhaustion issue.  I changed my hide NAT from single IP to range of IPs.   the error message has subsided to only a 1 or 2 every 5 minutes. 

thanks everyone for the feedback. 

0 Kudos
Timothy_Hall
Legend Legend
Legend

Thanks for the update Neil, I assume your firewall is using ClusterXL?  In that case I believe connections initiated by RAD will be hide NATTed behind the cluster VIP by default.  For anyone else looking to do a "many to fewer" hide NAT, it is covered in my book and my reply in this thread:

R80.10 - Hide behind many question 

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Neil_ZInk
Collaborator

I am using ClusterXL.

Ewane_Junior
Participant

Hello Neil,

Please do also note that the 400 (RAD_QUERIES_NUMBER_PER_CONNECTION 400) can be increased.

0 Kudos
genisis__
Leader Leader
Leader

Troubleshooting some issue and note a huge overall CPU drop when disabling AV/ABOT blades, additional rad process utilisation dropped completely (was running around 350%) to around 30%. 

Further investigation going to be done by CP TAC.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events