Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
PEO
Participant

Skyline - capacity of Prometheus/Grafana server

Any minimum and general capacity requirements for a Skyline server?

The usual stuff: Storage, RAM, CPU etc.

Lets say, supporting data from 50 gateways.

 

Regards, Poul Erik

0 Kudos
7 Replies
PhoneBoy
Admin
Admin

Grafana and Prometheus are Open Source.

You can see the minimal requirements for Grafana here: https://grafana.com/docs/grafana/latest/setup-grafana/installation/ 
Prometheus doesn't provide any hard guidance, but they do provide some details here:

How this translates to 50 Check Point devices, I can't say.
@Tsahi_Etziony approximately what specs are we using to test this?

Arik_Ovtracht
Employee
Employee

The storage requirement for the Prometheus database for storing Skyline data is roughly 25MB per reporting device for the default retention period of 15 days. This default can be changed in the Prometheus server configuration, and you should change the calculation accordingly.

We don't have exact numbers for the CPU and RAM requirements, but I can tell you that we have tested a couple dozen reporting devices to a Prometheus&Grafana server installed on a little VM with 4 cores and 8GB of RAM. For larger environments, I would advise getting something a little more robust, but you can see that it doesn't have to be too big.

David_Evans
Contributor

I was looking for this same information.   What I can tell you is that for our POC I have it running on about 75 devices.    The 4CPU Linux box is pretty busy with just a handful of alerts running and no one actively hitting the web dashboards.   I'm trying to work out how this is going to scale out to 700 devices.   A single 40 CPU VM probably isn't a great solution for a linear 10x increase.

top.pngmachinecount.png

0 Kudos
David_Evans
Contributor

We have grown out our deployment some more.   One thing we noticed,  Prometheus is not very efficient at doing its data pruning when its constrained by drive size.   We had hit the limit of the size set by "storage.tsdb.retention.size="  and it was spending a lot of CPU and IO cycles making room for the new incoming data.   In our environment at least setting "--storage.tsdb.retention.time" and letting it do its cleanup that way and manually managing the size of the data store  seems a much better use of resources.   Keep the storage.tsdb.retention.size set just to keep from filling up the drive accidently. 

Here are our current stats after fixing the storage.    before we were running 80% + across the CPU's most of the day.

 

Screenshot 2024-09-20 074358.pngScreenshot 2024-09-20 074425.png



0 Kudos
Elad_Chomsky
Employee
Employee

Hi @David_Evans ,

For much bigger environments I recommend to also take a look at Victoria-metrics, we had some good experience with it as well, and the API is almost identical to Prometheus. https://last9.io/blog/prometheus-vs-victoriametrics/. From what I understood it is more resource efficient and easier to scale then Prometheus. In general, it is also a possibility to split the Prometheus to multiple instances, and connect Grafana to all of them, however it will slightly increase the overhead of management of the environment. 

Daniel1107
Explorer

Hello Erik,

can you give us information how much capacity you use now for your Servers (Prometheus&Grafana)?

I have almost the same number of gateways so this would be very interesting for me.

 

Regards

Daniel

0 Kudos
Elad_Chomsky
Employee
Employee

It is varied, as we have multiple environments, but usually we use a server with the following spec -

4 Cores, 8-16 GB Memory, 50-100 GB Disk space.

Usually it is enough for all of our internal lab needs. But there is no specific recommendation - scaling should be done by monitoring and observing the current load on the system.

Upcoming Events

    CheckMates Events