This might help you tackle this problem.
I'd spent quite a bit of time digging into this exact problem in our environment and found that when having to serve large volumes of block pages to users that it was typically httpd that starts failing by leaking connections. Throwing resources at the problem by upping the number of workers or adjusting the session time and garbage collection didn't help in high load conditions because eventually the available workers get saturated.
This can be seen by listing the number of connections hitting your block page:
`netstat -np | egrep '/httpd|-' | egrep 'x.x.x.x:[0-9]{4,5}' | awk '{print $6}' | sort | uniq -c`
If you see a fair bit more CLOSE_WAIT connections than the configured httpd workers number then httpd is likely failing to keep up with closing connections which results in the connections being orphaned and are taking too long to expire due to age. A good article to read on this topic is:
https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
By looking for the IPs with the highest number of blocks I was then able to see what the users were trying to access that is generating the majority of the blocks:
`netstat -np | egrep '/httpd|-' | egrep 'x.x.x.x:[0-9]{4,5}' | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | sort -rn | head -10`
In SmartDashboard I then filtered for a TopX IP and action:redirect, then used the "Top Destinations" in the Filters Pane to narrow things down.
In my most recent case there was particular advertising domain, dt[.]adsafeprotected[.]com, that is (still waiting for recat of the site) incorrectly categorised as an "Inactive Site" instead of "Web Advertising" and so resulting in a block which was typically not visible to the users. Since the domain is heavily used in a number of major sites it accounted for around 75% of the block pages being served. Site categorisation overrides can be done using a custom site/app or an override object.
I wrote a quick and dirty script (attached) to aid investigations into how tuning the various known/suggested settings affected the overall performance and to monitor the state of the connections and Top10 users. I ran it through watch:
`watch -n 5 ./httpd_session_info.sh 2> /dev/null`
What became clear was that unless your /opt/CPUserCheckPortal/session directory has exorbitant numbers of files in it you should not have to adjust the php.ini file settings aside from adjusting the lifetime down to 1800 as mentioned in sk98773. I had set mine to 1200 (20 mins) with a garbage collection ratio of 2/100 and had pushed the workers up to 400 and still httpd lost the battle.
If after working through the heavy hitters you get to the point that the hits to your block page are valid and due to 'normal usage' then you can start upping the worker count to accommodate the volume. What you want to aim for is enough workers to allow the system to recover on its own, ideally the max number of workers should not be hit for too long.
I settled on 250 workers for about 1.5-2k users, this allowed for enough head room for the UserCheck portal to recover on its own. Long before I solved this problem for our clusters I'd written a python script that monitors the MPClient portals responsiveness, if a portal takes more than 8 seconds to respond it's considered down, the script then fires off an email notification to alert me which thankfully has not triggered in the last week and a half 😄