I am running into an strange issue.
I have an R80.10 Manager Take 91 on a Dell R730 Server. I have checkpoint installed on a 1TB SSD (Samsung 850 pro), and then an extended 4TB (Raid 5) /var/log partition with the same drives. It manages 12 firewalls.
I have 2 blades active on the manager. logging and status, and then management, that is it. I have log indexing enabled. I also have a separate smart event server with this manager as it's log source for it's correlation unit. Nothing real fancy.
uptime 20:47:33 up 11:17,
cpstat -f indexer mg
Total Read Logs: 19304312
Total Updates and Logs Indexed: 19304272
Total Read Logs Errors: 0
Total Updates and Logs Indexed Errors: 9000
Updates and Logs Indexed Rate: 361
Read Logs Rate: 359
Updates and Logs Indexed Rate (10min): 321
Read Logs Rate (10min): 322
Updates and Logs Indexed Rate (60min): 313
Read Logs Rate (60min): 313
Updates and Logs Indexed Rate Peak: 11197
Read Logs Rate Peak: 10951
Read Logs Delay: 0
I am having suffering from some bad performance issues. When looking at top, the load averages can hit 10+ on a 12 core box. the culprit is very high CPU wait time.
I have no issues with RAM. i have 32GB. I also have 2 CPU's with 6 cores each, so no issue i see there.
However, the issue i am seeing is that there is a disk IO bottleneck from iostat.
I am seeing that i have low w/MB/s (5-10), but yet the %util column will peak to 100% on the drives.
Policy installs take 10 minutes at least. smart console will freeze. It's just not too usable at this point. I think i am leaking logs, and also have weird issues on our firewalls as a result of these performance issues. none of my firewalls have any high CPU loads.
I had two boxes before in VMware. A manager (with nothing else) on R80.10, and a dedicated smartlog server on R77.30. I actually had no issues on the manager (policy installs only took a couple minutes or less), but we did have some weird issues with the log server. I think it possibly could be IO disk bottlenecks due to VMware, but not 100%. I just wanted to move everything to one box, and i just so happened to come across this hardware, so i figured why not. I wanted to take advantage of the logging features in R80.10 Smart console, and save the SAN space.
I am a mid-range enterprise, and there is no way i can be taxing these drives even close. The issue seems to be that checkpoint is trying to use the drives, but for some reason it contentiously is waiting on the disk, because it thinks it;s busy for some reason.
I am at a loss. this seems like something at the Linux level (more so than Checkpoint level) that i just cant seem to wrap my head around. that, and a lot of tools for this are not installed on checkpoint.
i am thinking that like this just is not meat to work? would i see this behavior, if these were not SSD's? it's like I don't have the right file system, or the appropriate Linux kernel version to fully optimize these drives? Maybe it's a firmware issue on my drives? But then again, the checkpoint appliances have SSD's and they have the same code, so i don't know why this wont work on an open server that is on the HCL. I did tweak the FWASYNC_MAXBUFF to 800MB of memory. that did help the performance a little bit, since the memory for the processes to communicate on the system was hardcoded and set to low. I even thought that i just had too much RAM and CPU. I started off with 132Gig of ram and 24 CPU's. I was told that that may just be too inefficient somehow, so i took it down to 12 cores and 32Gig of RAM, but no change.
I have a ticket with TAC open now, but i am not getting really anywhere to this point. I just wanted to get some thoughts from the good old CP community, to see there was any thoughts that help me in the right direction that i could run with to troubleshoot this further.
Any help would be greatly appreciated.