Solved: Log rate poll

Kaspars_Zibarts · ‎2021-09-21

I'm just curious how far you are able to push your log server / MLM when it comes to total events arriving per second?

Reference to SK and quick script to see log rate here: Log rate calculator (I have to say that I used SK88681 - numbers seem to differ a lot compare to SK120341!)

It seems like we have hit the "roof" going above 10,000 logs per second arriving on our log server. When I say hitting roof - I mean that we saw noticeable delay in seeing events in SmartLog, delayed more than 20 minutes plus we could see that FWD handling this particular CLM / log server run at 100% CPU.

Quick workaround was to change log target on some gateways and point them to the secondary MDS/CMA instead of MLM/CLM so FWD load got distributed to two different servers and we returned to "normal" after indexing caught up with the backlog.

We are running log server / MLM on ESX VM: 16 cores, 128GB RAM

From sk173768 - Testing Disk Drive Read/Write Speeds I get 300MB/s read and 100MB/s write speeds (that's on running busy system!)

Just wondering if you have succeeded to push incoming log rate any higher?
Any tips how to get around single threaded FWD running 100% CPU?
Share your numbers 🙂

Before you say to reduce logging - we love our logs, such a great troubleshooting tool! 🙂

Henrik_Noerr1 · ‎2021-09-23

yeah - raw mhz will win in a singlethreaded application 🙂

Now if we could just have a multithreaded fwd, but that's surely 2024. We will have 5 MLMs at that time.

/Henrik

View solution in original post

Henrik_Noerr1 · ‎2021-09-22

we are struggling with this as well with fwd running 100% on gws

But in general all cores on MLM are 100% loaded - 32core server server and 256gb ram.

running r80.30 take 217

I saw a sk or was it jumbo regarding the log_indexer is now multithreaded with 4 threads, but I cannot seem to find it again.

Log rate per second, measuring for 5s
----------------------------------
customer1-CLM-01: 0
customer2-clm-01: 21376
customer3-clm-01: 6452
customer4-clm-01: 7708
customer5-clm-01: 5820
customer6-vsxclm-01: 39
customer7-clm-01: 5864
customer8-clm-01: 15980
customer9-clm-01: 31
----------------------------------
Total rate: 63270

with that said - No I don't really have any great tips. We are looking into getting two more mlms and spreading the load.

It doesn't fix the 100% fwd load though.

Kaspars_Zibarts · ‎2021-09-22

Thanks for replying @Henrik_Noerr1 ! really appreciated

I would be keen to know the FWD load on customer8-clm-01 if possible? One that run 15k log rate

I.e. get FWD PID: ps aux | grep customer8-clm-01 | grep fwd

And then use PID (for example 12345) in the top: top -b -d 3 -p 12345 | grep 12345

Plus could you check the CPU: cat /proc/cpuinfo | grep "model name" | head -1

I just did some tests pointing all my gateways to the busy CLM instead of having split logging and it was running actually OK for a while with 10,000 logs/sec, CPU was sitting at approx 70%, but then after a while it just went through the roof even though log rate was still at 10k. So I'm digging into it now

Henrik_Noerr1 · ‎2021-09-23

domain X current log rate: 16500

[Expert@mlm-01:0]# top -b -d 3 -p 105453 | grep 105453
105453 admin 20 0 1196712 541184 12288 S 40.0 0.2 38088:10 0 fwd
105453 admin 20 0 1196712 544876 12288 S 22.3 0.2 38088:11 16 fwd
105453 admin 20 0 1196712 548168 12288 S 22.3 0.2 38088:11 0 fwd
105453 admin 20 0 1196712 551864 12288 S 35.5 0.2 38088:12 0 fwd
105453 admin 20 0 1196712 555596 12288 S 28.0 0.2 38088:13 0 fwd
105453 admin 20 0 1196712 559472 12288 R 38.5 0.2 38088:14 0 fwd
105453 admin 20 0 1196712 562884 12288 S 31.7 0.2 38088:15 0 fwd
105453 admin 20 0 1196712 566352 12288 S 27.6 0.2 38088:16 16 fwd

domain Y current log rate: 22000
[Expert@mlm-01:0]# top -b -d 3 -p 105649 | grep 105649
105649 admin 20 0 1370728 810560 46392 R 100.0 0.3 55546:15 23 fwd
105649 admin 20 0 1370728 814492 50324 R 98.7 0.3 55546:18 29 fwd
105649 admin 20 0 1370728 820696 56528 R 100.0 0.3 55546:21 29 fwd
105649 admin 20 0 1370728 826504 62336 R 98.7 0.3 55546:24 29 fwd
105649 admin 20 0 1370728 830728 66560 R 99.3 0.3 55546:27 29 fwd
105649 admin 20 0 1370764 800160 35992 R 99.0 0.3 55546:30 31 fwd
105649 admin 20 0 1370764 806496 42328 R 99.7 0.3 55546:33 31 fwd
105649 admin 20 0 1370764 810720 46552 R 97.0 0.3 55546:36 20 fwd
105649 admin 20 0 1370764 815972 51804 R 99.3 0.3 55546:39 4 fwd

[Expert@mlm-01:0]# cat /proc/cpuinfo | grep "model name" | head -1
model name : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz

Kaspars_Zibarts · ‎2021-09-23

Cool thanks! Really appreciated! Now I know that we need faster processors in VM!

yours is bigger than mine haha!

Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz

Henrik_Noerr1 · ‎2021-09-23

yeah - raw mhz will win in a singlethreaded application 🙂

Now if we could just have a multithreaded fwd, but that's surely 2024. We will have 5 MLMs at that time.

/Henrik

Kaspars_Zibarts · ‎2021-09-23

Indeed! Or create another domain... Luckily I can offload some logs to our standby MDS for now

JanVC · ‎2021-09-23

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

i'm guessing you meant this sk ?

Tomer_Noy · ‎2021-09-23

I'll try to share some information that might be helpful.

The fwd process runs on both the gateway and the log server / CLM. On the gateway side it is responsible for formatting the logs and sending them over the network. In high log rates, this can be quite compute intensive and indeed this process is single threaded so we might reach a bottleneck. In light of growing demand for more logs and more powerful gateways, we are developing a project that will allow multi-process fwd. Instead of making it multi-threaded, we will run multiple instances and greatly increase the maximum log throughput. This project will probably be targeted for R81.30, but we might have something earlier if someone is interested in an alpha.

On the log server side, fwd is responsible for accepting the logs and writing them into log files. On this end, fwd is actually simpler and less compute intensive. Most performance issues that we see on the log server are related to indexing and querying, since those operations have to "crunch" a lot of data. R81 brought significant performance improvements to both indexing and queries with the updated Solr indexing engine. R81.10 brought another enhancement that lets you distribute logs from a single gateway to multiple log servers / CLMs. This greatly helps with the scenario of a very powerful gateway with a lot of logging that exceeds the indexing capacity of a single log server. It also handles redundancy in a much more elegant way.

In your case, it indeed looks like fwd is maxing out on CPU. Beyond a stronger CPU or splitting to multiple log servers, I can suggest a few more things to check:
1) Since you are running on a virtualized environment, verify that resources are reserved for your MLM and not shared with other VMs.
2) Check if IO is high.
3) Check if you have many logs of type "Alert". You can check the rulebase "Track" definition, or do a query on the logs. Handling alerts in fwd adds additional computation flows and in large numbers might increase the load on it.

Kaspars_Zibarts · ‎2021-09-23

Thanks @Tomer_Noy great to have full explanation! Really appreciated!

1) yes, we are moving to more isolated environment now with higher clock speed CPU, so that should help us!

2) IO was actually quite low - it was my first suspicion, but it was far from max utilisation

3) good to know! but was not case for us

Kaspars_Zibarts · ‎2021-09-24

@Tomer_Noy one additional question:

I noticed massive difference in FWD load (on Mgmt) depending if we have log export running in that CMA/CLM.

For example - same CMA, but logs now have been split

some gateways send it to secondary MDS/CMA - generates 25k logs/sec, FWD sits at approx 50% (so max would be 50k/s). Log export is OFF.

other gateways send it to MLM/CLM - generates 5k logs/sec, FWD sits at 65%(so max would be less than 10k/s), but log export is ON. Clock speed is better on these CPUs BTW

So it looks like log export will have major impact on FWD performance

Tomer_Noy · ‎2021-09-25

Which type of log export are you using?

The new LogExporter, or the legacy LEA?

Kaspars_Zibarts · ‎2021-09-26

LogExporter Tomer, thanks for checking this!

Are you a member of CheckMates?

Log rate poll