How can Checkpoint optimize I/O performance with S...

Danny_Olson · ‎2018-04-12

Hello,

I am running into an strange issue.

I have an R80.10 Manager Take 91 on a Dell R730 Server. I have checkpoint installed on a 1TB SSD (Samsung 850 pro), and then an extended 4TB (Raid 5) /var/log partition with the same drives. It manages 12 firewalls.

I have 2 blades active on the manager. logging and status, and then management, that is it. I have log indexing enabled. I also have a separate smart event server with this manager as it's log source for it's correlation unit. Nothing real fancy.

uptime 20:47:33 up 11:17,

cpstat -f indexer mg

Total Read Logs: 19304312
Total Updates and Logs Indexed: 19304272
Total Read Logs Errors: 0
Total Updates and Logs Indexed Errors: 9000
Updates and Logs Indexed Rate: 361
Read Logs Rate: 359
Updates and Logs Indexed Rate (10min): 321
Read Logs Rate (10min): 322
Updates and Logs Indexed Rate (60min): 313
Read Logs Rate (60min): 313
Updates and Logs Indexed Rate Peak: 11197
Read Logs Rate Peak: 10951
Read Logs Delay: 0

I am having suffering from some bad performance issues. When looking at top, the load averages can hit 10+ on a 12 core box. the culprit is very high CPU wait time.

I have no issues with RAM. i have 32GB. I also have 2 CPU's with 6 cores each, so no issue i see there.

However, the issue i am seeing is that there is a disk IO bottleneck from iostat.

I am seeing that i have low w/MB/s (5-10), but yet the %util column will peak to 100% on the drives.

Policy installs take 10 minutes at least. smart console will freeze. It's just not too usable at this point. I think i am leaking logs, and also have weird issues on our firewalls as a result of these performance issues. none of my firewalls have any high CPU loads.

I had two boxes before in VMware. A manager (with nothing else) on R80.10, and a dedicated smartlog server on R77.30. I actually had no issues on the manager (policy installs only took a couple minutes or less), but we did have some weird issues with the log server. I think it possibly could be IO disk bottlenecks due to VMware, but not 100%. I just wanted to move everything to one box, and i just so happened to come across this hardware, so i figured why not. I wanted to take advantage of the logging features in R80.10 Smart console, and save the SAN space.

I am a mid-range enterprise, and there is no way i can be taxing these drives even close. The issue seems to be that checkpoint is trying to use the drives, but for some reason it contentiously is waiting on the disk, because it thinks it;s busy for some reason.

I am at a loss. this seems like something at the Linux level (more so than Checkpoint level) that i just cant seem to wrap my head around. that, and a lot of tools for this are not installed on checkpoint.

i am thinking that like this just is not meat to work? would i see this behavior, if these were not SSD's? it's like I don't have the right file system, or the appropriate Linux kernel version to fully optimize these drives? Maybe it's a firmware issue on my drives? But then again, the checkpoint appliances have SSD's and they have the same code, so i don't know why this wont work on an open server that is on the HCL. I did tweak the FWASYNC_MAXBUFF to 800MB of memory. that did help the performance a little bit, since the memory for the processes to communicate on the system was hardcoded and set to low. I even thought that i just had too much RAM and CPU. I started off with 132Gig of ram and 24 CPU's. I was told that that may just be too inefficient somehow, so i took it down to 12 cores and 32Gig of RAM, but no change.

I have a ticket with TAC open now, but i am not getting really anywhere to this point. I just wanted to get some thoughts from the good old CP community, to see there was any thoughts that help me in the right direction that i could run with to troubleshoot this further.

Any help would be greatly appreciated.

Vladimir · ‎2018-04-13

Few things of note:

Regarding SSD use optimization: there are presently none. See https://community.checkpoint.com/message/16815-mid-size-appliances-can-not-be-managed-locally-when-c...

As to a Dell server with 2x6 cores: you are better off redeploying that server as ESXi and running Check Point on top of that. Reason being is the lack of support and the adverse reaction of hyper-threading by CP.

You will get better performance from VM configured with 4x1 vCPUs.

Timothy_Hall · ‎2018-04-13

100% agree with Vladimir Yakovlev‌ here, disable hyperthreading and use 8 discrete CPUs for your SMS if you can. There is a disk I/O bottleneck in the current Gaia mass storage kernel driver that is exacerbated by hyperthreading, relief is on the way however with a Gaia kernel update in an upcoming release. In my experience 8 discrete CPUs seems to be the sweet spot for R80+ SMS performance.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Daniel_ · ‎2018-09-13

Timothy Hall schrieb:
There is a disk I/O bottleneck in the current Gaia mass storage kernel driver that is exacerbated by hyperthreading, relief is on the way however with a Gaia kernel update in an upcoming release

Are you talking about 80.20M1?

Someone knows if this it fixed in?

[Expert@mds1:0]# fw ver
This is Check Point's software version R80.20 - Build 001
[Expert@mds1:0]# uname -a
Linux mds1 3.10.0-693cpx86_64 #1 SMP Tue Feb 6 12:13:02 IST 2018 x86_64 x86_64 x86_64 GNU/Linux‍‍‍‍

Martin_Valenta · ‎2018-10-26

https://sc1.checkpoint.com/documents/R80.10/Performance/R80.10-performance-guide.html

Use of hyper-threading:

For R80.10 Security Management, we recommend you keep hyper-threading off for performance reasons. When hyper-threading is enabled, this can increase the load on the storage kernel driver at the R80.10 Gaia operating system. Read more about hyper-threading at Intel.com.

Note - This recommendation is only for R80.10. It may change for R80.20.

I would guess that with kernel 3.10 on mgmt it's improved.

Timothy_Hall · ‎2018-10-26

I asked about SMT/Hyperthreading for the SMS during my trip to Israel, and while the I/O subsystem performance has definitely been improved by the 3.10 kernel, I got the same answer in regards to Hyperthreading on a SMS for R80.20 that I did for R80.10: it may help management performance and it may hurt performance depending on the type of workload. Just have to try it I guess...

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

HeikoAnkenbrand · ‎2018-04-13

I agree 100% with Vladimir and Tim.

We have a similar setup about 20 firewalls.

Here we use the following settings:
- ESX 6.0

- HP DL 380 with SSD
- Hyperthreading off
- max. vCPUs
- 32GB RAM for ESX and 30GB for VM

Tip 1:
R80.10 Log Server and R77.30 Smart Event works but has some weaknesses. I would always use the same version here, because the Correlation Unit works better. We had some performance issues in this construct.

Maybe the problems will be solved. I would install a second log server which is only for Smart Event/Reporter. That's what Check Point suggests for medium and enterprise solutions.

Tip 2:

If necessary, reduce the log entries in the firewall rules.

Regards,

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

Vladimir · ‎2018-04-13

Heiko,

Can you point me to the SK describing the approach you are referencing: "...second log server which is only for Smart Event/Reporter. That's what Check Point suggests for medium and enterprise solutions."?

Thank you,

Vladimir

HeikoAnkenbrand · ‎2018-04-13

Vladimir,

I have no sk to this statement.

I have already had several support tickets open for this issue.

The following was recommended from support:

Use a separate Log Server + Smart Event Correlation Unit on a single server if you have high log traffic...

Regards

Heiko

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

AlekseiShelepov · ‎2018-04-13

I am not sure in which part of this statement you are mostly interested in, but this deployment is described in Logging and Monitoring Admin Guide or in SmartReporter Admin Guide:

The SmartEvent Architecture
The SmartEvent components can be installed on one computer (that is, a standalone deployment) or multiple computers and sites (a distributed deployment). To handle higher volumes of logging activity, we recommend a distributed deployment. You can install more than one SmartEvent Correlation Unit. Each SmartEvent Correlation Unit can analyze logs from more than one Log Server or Domain Log Server.

Basically, it will just take some resources for correlation to a separate server. Although, as I remember it is less helpful with R80.10 because SmartLog correlation is enabled by default on the main server, so it takes some part of load from correlation unit - SmartEvent Correlation units.

Or you were interested if it is recommended to install SmartEvent and SmartReporter on one server together?

Vladimir · ‎2018-04-13

Not quite what I was asking, but I think it deserves a separate thread.

I'll post it in a minute.

Danny_Olson · ‎2018-04-13

Sorry forgot to mention that HT is disabled.

Dawei_Ye · ‎2018-04-20

We also met this situation on new 5900 appliances with SSD.

At our peak time, the CPU IO wait could be very high.

We have open a ticket ,but still not resolved.

Danny_Olson · ‎2018-04-25

Thanks All,

sk118012 lists the R80.10 Gaia OS bulid numbers and kernel versions per released image. There is an ISO that is based on an updated version of the Linux kernel, but unfortunately, its not supported our particular open server. I was directed to use R80.10 Gaia OS build 17, but it would not install. My understanding is that if you get open server, you can not just by/use whatever SSD's you want. I am just putting 10K SAS Dell branded SAS drives in our Dell R730 server, and that looks like that is going to get us where we need to go. Thanks for the help everyone! I love checkmates; such a good great tool. I just wanted to make use of these nice Samsung 850 pros we had lying around. Only with ESX can you do so.

Are you a member of CheckMates?

How can Checkpoint optimize I/O performance with SSD's?