SmartLog becomes slow / unusable

Kaspars_Zibarts · ‎2018-04-05

This is a long shot and Checkpoint crew not going to like me as there is very little detail / investigation done but it's worth a try

In nutshell: our SmartLog searches occasionally become very slow so you can't even use it.

We run MLM (VM, 16 cores / 128GB / 6TB) with handful of CLMs, R80.10 take 42, log rate averaging 10k/sec.

We had a case open but that lead to nowhere (due to late response from R&D initially and then lack of time and resources on our part as it was round Xmas)

Typically MLM reboot fixes it for a while and then it comes back.

We were waiting for R80.20 which promises faster file system, but

But my colleague pointed out very interesting fact: we used Tufin a lot to do firewall changes back in December and we had a lot of SmartLog "slowness". Then we backed off from Tufin changes for a while in Feb/Mar making log search noticeably better.

Yesterday I did a batch of changes through Tufin SecureChange (9-10AM) and log search ground to halt again and we were forced to reboot MLM around 1PM.

For all I know - Tufin utilizes API to do changes in R80.10 management. Could it be that long / complex API queries / results somewhat clog up available resources on MDS/CMAs? We kind of noticed that every time I run some API scripts (non-Tufin) it had the same impact on SmartLog performance.

I know Tomer Sole‌ you might want to say something as you love API

And don't get me wrong - I love both SmartLog and API, so I'm not whinging just want to hear others. I will keep digging

Timothy_Hall · ‎2018-04-05

The processes associated with log indexing/searching and the API that may be related to your problem are:

cpm
SOLR/java_solr
RFL/LogCore
INDEXER/log_indexer
SmartLog_Server

What I would suggest is baselining these key processes when search performance is good in regards to:

CPU & Memory Usage (Use ps and top commands)
Disk Usage - Unfortunately the current Gaia kernel does not support the use of the iotop command, so there is no direct way to view disk utilization per process; you will be stuck looking at system-wide disk I/O stats with iostat and having to infer what is going on.

Then wait for a period of terrible search performance (or induce it with a bunch of Tufin queries) then examine how these processes have changed. If you spot one that is chewing up lots of CPU/memory/disk, try GENTLY killing that process with kill -15 (not -9 unless necessary), give the killed process 60 seconds to respawn, wait another 60 seconds for it to fully initialize, and see if good search performance has returned.

Once you have identified the process that may be going out to lunch on you, further debugging can be attempted.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Kaspars_Zibarts · ‎2018-04-05

Funny enough that's exactly what I wrote in a script to collect stats before and after. Plus checking for any zombie processes. But then had no time to "execute".. tomorrow it is

Oren_Koren · ‎2018-04-05

Hey,

i want to take it also internally.(with BizDev & Tufin).

please send me a mail to orenkor@checkpoint.com in this topic (i cant find your mail address in your profile)

Thanks,

Oren

Kaspars_Zibarts · ‎2018-04-10

Thanks Oren Koren‌ and Tomer Sole‌ for such quick response! You guys saved my day (without SR!)

In nutshell - MDM/MLM upgrade to take 91 seems to have resolved our slow log issues when using Tufin or API scripts

Some notes that I'm copying from Tomer's emails, that made the correlation

Tufin uses an external database to model the policy based on Check Point logs. So I can see how the log server could get to some thresholds

Did you move your Management to R80.10 Jumbo Hotfix Take 70 from Jan.15. A performance issue with the Management API regarding very large groups was resolved in that update

‌

Tomer_Sole · ‎2018-04-10

Cool, also thanks for the badge!

I'm also happy to hear the via-Tufin slowness was resolved. We value Tufin as a great technology partner of Check Point.

Are you a member of CheckMates?

SmartLog becomes slow / unusable