Re: UserCheck Block Page Times Out

Trey_Havener · ‎2019-05-01

We just cut over to our 5400 cluster, and during testing the Block Page displayed fine. Today during the cutover however, the block page seems to keep timing out. We aren't doing much on the block page but telling them why they were blocked and to contact us if they feel it's in error. If I do an incognito tab and then sometimes that will work but most of the time it times out as well. I have a ticket open but wanting to see if anyone else has had this problem. We aren't doing any https inspection...not ready for that nightmare. Just URL filtering.

Vladimir · ‎2019-05-01

Are you sure it times out or are you presented with the empty page on HTTPS resources and with the Block page on HTTP?

Trey_Havener · ‎2019-05-02

Like I said if I use the incognito tab option I am presented a block page some times no matter http/s and it appears chrome gets the page more often then firefox and ie... This morning it seems to be more hit then miss. Not sure if the block page issue is performance related, but our firewalls aren't really getting hit all that much. They are sized appropriately.

Maarten_Sjouw · ‎2019-05-02

I take it you do have the usercheck page to the proper IP / URL that resolves properly and you have a access rule that allows the traffic to the gateway, above the stealth rule.

Regards, Maarten

Trey_Havener · ‎2019-05-02

It does resolve, and there is a rule above the stealth rule to allow that traffic. Like I said it works some times not all the time. I feel like if either of those things weren't set that it would never work. Also it never works in Firefox.

Steve_Payne · ‎2019-05-02

Have you tried restarting UserCheck?

[Expert@HostName]# mpclient stop UserCheck
[Expert@HostName]# mpclient start UserCheck

You could also look at sk85040. You may need to increase the number of HTTP sessions.

Timothy_Hall · ‎2019-05-02

In some cases involving HTTPS connections when HTTPS Inspection is not enabled, certain browsers will refuse to show the UserCheck page because it thinks there is a man-in-the-middle attack occurring against the connection, which technically there is by virtue of the firewall trying to stuff an alternate web page into the connection. So try to establish under what specific circumstances the UserCheck page is not appearing where the variables are the browser being used, website being visited, and the firewall ingress interface for the client. If you can establish that different browsers exhibit different (but consistent) behavior in regards to the UserCheck page appearing for a certain site, that is to some degree expected and there is not much you can do about it short of enabling HTTPS Inspection. If clients coming in on a certain firewall interface are consistently not getting UserChecks, that indicates that the IP address in the UserCheck URL is not reachable coming in on that specific interface.

However if there is no consistent pattern and it seems truly "random", check the stability of the fwucd and usrchkd daemons on the firewall and make sure they are not crashing or having other issues. Might be enlightening to check log files $FWDIR/log/usrchkd.elg and $FWDIR/log/fwucd.elg to see if any interesting error messages are being barfed into them.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

G_W_Albrecht · ‎2019-05-03

This is found in SMB Appliances and SMP, but you speak of a 5400 cluster - so what is true ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Trey_Havener · ‎2019-05-03

No up on my CheckPoint product info. It is a 5400 cluster.

Okay, IE and Chrome works but the page some times takes forever to load or it may not fully load the UserCheck block page. Firefox will not load it period I get this: Secure Connection Failed
The connection to the server was reset while the page was loading. The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.
Our previous firewall never had this problem. And we didn't use https inspection on it. I thought https inspection would cause more man in the middle bugs then not having it enabled.

PhoneBoy · ‎2019-05-03

Most likely, you just need to configure Firefox to trust the certificate the gateway is using to serve up the UserCheck portal.
Firefox uses a different certificate store than IE and Chrome on Windows.

PhoneBoy · ‎2019-05-03

Not anymore 😬

Trey_Havener · ‎2019-05-08

Could this be part of my problem?

Time: 2019-05-08T12:20:44Z
Id: ac1f6e54-0100-00c0-5cd2-c99c00000011
Sequencenum: 60
Protection Name: Non Compliant HTTP
Severity: Critical
Confidence Level: Medium
Protection ID: BlockHttpNonProtocolCompliant
Performance Impact: Low
Protection Type: Protocol Anomaly HTTP
Policy Rule UID: 8b7e6663-2382-4d20-98ae-d7425eece7f3
Sub Policy Name: Network
Sub Policy Uid: 688c78ce-c61c-4799-8101-73e9256dd7f8
Reason: Connection queue exceeded max size
Client Type: Other: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Name: Block HTTP Non Compliant
Source: 172.31.108.39
Source Port: 57170
Destination: 172.31.110.81
Destination Port: 80
IP Protocol: 6
Proxied Source IP: 172.31.108.39
Source Machine Name: hostname@domain.local
Session ID: 0
Action: Reject
Type: Log
Policy Name: Standard
Policy Management: cp-smartappliance
Db Tag: {64DC84C1-EE9B-F649-B404-3092383FFF3B}
Policy Date: 2019-05-07T22:18:50Z
Blade: Firewall
Origin: cp-gateway1
Service: TCP/80
Product Family: Access
Logid: 65537
Resource: http://172.31.110.81/UserCheck/PortalMain?IID=1DE7C584-961B-C9FB-BAFE-F1F5AA48CC3E&origUrl=aHR0cDovL3d3dy5nb29nbGV0YWdzZXJ2aWNlcy5jb20vdGFnL2pzL2dwdC5qcw
Marker: @A@@B@1557291602@C@1472301
Log Server Origin: 172.31.110.240
Orig Log Server Ip: 172.31.110.240
Index Time: 2019-05-08T12:20:45Z
Inspection Settings Log:true
Layer Uuid Rule Uuid: _8b7e6663-2382-4d20-98ae-d7425eece7f3
Access Rule Number: 4
Access Rule Name: Mgmt
Lastupdatetime: 1557318044000
Lastupdateseqnum: 60
Rounded Sent Bytes: 0
Rounded Bytes: 0
Stored: true
Rounded Received Bytes: 0
Description: http Traffic Rejected from User2, User2 (user2)(172.31.108.39) to 172.31.110.81
User: User1, User1 (user1), User2, User2 (user2)
Source User Name: User1, User1 (user1), User2, User2 (user2)
Src User Dn: XXXXXXXXXXWould be src dn...yada
Profile: Go to profile

Louis_Poulin · ‎2019-05-09

We have similar problem where under load, the usercheck page doesn't load and times out.

We had this under a R80.20 VSX cluster and we are now running the firewall on a R80.20 cluster. 5000 users, HTTPS inspection enabled. 15600 appliances.

It used to work ok with R80.10, but since R80.20, never worked as well as before.

SK85040 was followed with both versions.

TAC couldn't help us so far because nothing is logged in error messages because the process seems to be working to hard to log… we are kinda stuck.

Trey_Havener · ‎2019-05-09

We tried https inspection for about 30 minutes, and had to shut it back down. Stupid credit card terminals shut down...couldn't see any logging from them once that was turned on. The block page seemed to work fine when that was turned on. Its just turning it on...

Louis_Poulin · ‎2019-05-09

Inspired by

I increased MaxRequestWorkers and ServerLimit from 100 to 256 (which seems to be the default from Apache's point of view instead of 28 from Check Point's point of view. Maybe it's not the same version?). Since there is 15GB of RAM free, I considered it to be safe.

usrchkd was restarted.

User Check page has been working for an hour now. We'll see if it last.

Hopefully it can help someone!

Louis_Poulin · ‎2019-05-10

After 1 day, the User Check page is still working!

Louis_Poulin · ‎2019-05-13

UserCheck block page keeps loading today…

Restarting the process resolved the issue (mpclient stop UserCheck; mpclient start UserCheck).

We are considering an automated restart of the service by putting those commands in the crontab 🙂

Gregory_Link · ‎2019-07-25

Seeing this same issue after migrating to R80.20.

abihsot__ · ‎2019-07-29

Hi there,

We also had to increase serverlimit and maxrequestworkers way more than it is mentioned in SK. Usercheck keeps complaining about "server reached maxrequestworkers", and we are not even close to 5k users... TAC is just making me laugh when suggesting to block traffic without usercheck interaction.

How many connections do you usually have during peak time?

netstat -anp |grep `mpclient getdata UserCheck |awk '{print $6}'` |wc -l

AlanTen · ‎2019-09-17

This might help you tackle this problem.

I'd spent quite a bit of time digging into this exact problem in our environment and found that when having to serve large volumes of block pages to users that it was typically httpd that starts failing by leaking connections. Throwing resources at the problem by upping the number of workers or adjusting the session time and garbage collection didn't help in high load conditions because eventually the available workers get saturated.

This can be seen by listing the number of connections hitting your block page:

`netstat -np | egrep '/httpd|-' | egrep 'x.x.x.x:[0-9]{4,5}' | awk '{print $6}' | sort | uniq -c`

If you see a fair bit more CLOSE_WAIT connections than the configured httpd workers number then httpd is likely failing to keep up with closing connections which results in the connections being orphaned and are taking too long to expire due to age. A good article to read on this topic is:

https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/

By looking for the IPs with the highest number of blocks I was then able to see what the users were trying to access that is generating the majority of the blocks:

`netstat -np | egrep '/httpd|-' | egrep 'x.x.x.x:[0-9]{4,5}' | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | sort -rn | head -10`

In SmartDashboard I then filtered for a TopX IP and action:redirect, then used the "Top Destinations" in the Filters Pane to narrow things down.

In my most recent case there was particular advertising domain, dt[.]adsafeprotected[.]com, that is (still waiting for recat of the site) incorrectly categorised as an "Inactive Site" instead of "Web Advertising" and so resulting in a block which was typically not visible to the users. Since the domain is heavily used in a number of major sites it accounted for around 75% of the block pages being served. Site categorisation overrides can be done using a custom site/app or an override object.

I wrote a quick and dirty script (attached) to aid investigations into how tuning the various known/suggested settings affected the overall performance and to monitor the state of the connections and Top10 users. I ran it through watch:

`watch -n 5 ./httpd_session_info.sh 2> /dev/null`

What became clear was that unless your /opt/CPUserCheckPortal/session directory has exorbitant numbers of files in it you should not have to adjust the php.ini file settings aside from adjusting the lifetime down to 1800 as mentioned in sk98773. I had set mine to 1200 (20 mins) with a garbage collection ratio of 2/100 and had pushed the workers up to 400 and still httpd lost the battle.

If after working through the heavy hitters you get to the point that the hits to your block page are valid and due to 'normal usage' then you can start upping the worker count to accommodate the volume. What you want to aim for is enough workers to allow the system to recover on its own, ideally the max number of workers should not be hit for too long.

I settled on 250 workers for about 1.5-2k users, this allowed for enough head room for the UserCheck portal to recover on its own. Long before I solved this problem for our clusters I'd written a python script that monitors the MPClient portals responsiveness, if a portal takes more than 8 seconds to respond it's considered down, the script then fires off an email notification to alert me which thankfully has not triggered in the last week and a half 😄

abihsot__ · ‎2020-01-22

finally!!! Checkpoint did something with usercheck although there is nothing in JHF release notes. Once we deployed R80.30 JHF111 number of httpd processes went so low that I was afraid that monitoring script stopped working 😄 Now we can slowly return to normal php session and maxrequestworker values...

Timothy_Hall · ‎2020-01-22

Thanks for the follow-up, sounds like a lot of user-space processes on the gateway have been getting some love in the latest Jumbo HFAs, including the critical Resource Advisor Daemon (rad) which many blades rely upon for timely categorization responses. The rad daemon went multi-threaded in R80.30 Jumbo HFA Take 107+, see sk163793: How to scale up requests/responses RAD handling rates.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Raphael_Cote · ‎2020-01-22

It's really nice that the RAD fix is finally included now in the JHF, we worked on that for over a year to find a solution!!

Are you a member of CheckMates?

UserCheck Block Page Times Out