Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Alexander_Wilke
Advisor

Skyline Custom Script - Maestro Chassis State and add additional processes to otlp monitoring

Hello,

I am running R81.20 + JumboHFA Take 99. I use the Skyline packages:

BUNDLE_CPVIEWEXPORTER_AUTOUPDATE Take: 67
BUNDLE_CPOTLPAGENT_AUTOUPDATE Take: 92
BUNDLE_CPOTELCOL_AUTOUPDATE Take: 179

I created my script based on this not so complete and correct documentation:
https://sc1.checkpoint.com/documents/Appliances/Skyline/Content/Topics-AG/Custom-Metrics.htm#:~:text....

Idea:
there are some metrics which show the cluster_xl status. But these metrics only show the status of the SGM within a Maestro environment. As all SGMs are in general "ACTIVE" there is no indicator/metric which shows which chassis is the ACTIVE chassis or if the chassis is ACTIVE at all or down or chassis admin down, standby etc.

So I tried with a simple bash script to collect these information from "asg stat -v" and put this into otlp metrics. here is my script, my commands and a documentation which helps me to understand what to do.

 

### create the script which collects the information we need:
vi /config/skyline_custom_metrics/chassis_state.sh

____________________________________________________________________________________________________________________________________________________

#!/bin/bash
### Include Checkpoint environment variables
source /opt/CPshrd-R81.20/tmp/.CPprofile.sh
. /opt/CPotlpAgent/cs_data_handler_is.bash


## script part to check the values

# status chassis 1
chassis_1=$(asg stat -v | grep "SGM ID" -A1 | grep -v "SGM ID" | awk -F ' ' '{print $2}')

# status chassis2
chassis_2=$(asg stat -v | grep "SGM ID" -A1 | grep -v "SGM ID" | awk -F ' ' '{print $3}')

# reset variable "metric_value"
metric_value=0  

# check if both chassis are ACTIVE (split brain) and set metric_value
if [[ "$chassis_1" == *ACTIVE* && "$chassis_2" == *ACTIVE* ]]; then  
    metric_value=3  
# check chassis_1
elif [[ "$chassis_1" == *ACTIVE* ]]; then  
    metric_value=1  
# check chassis_2
elif [[ "$chassis_2" == *ACTIVE* ]]; then  
    metric_value=2  
# check if no chassis is active set 0
else  
    metric_value=0  
fi  



## Building the metric otlp with its labels and values
     # value of the metric itself. if not gauge or counter set to 1 or something else
set_ot_object new value ${metric_value}

     # define a label and its value. the value is result from the previous script and was saved in a variable and used here.
     # we add "host_name" as this is needed for identification
     # set_ot_object last label host_name ${host_name}
     # define a label and its value. the value is result from the previous script and was saved in a variable and used here
set_ot_object last label chassis_1 ${chassis_1}

     # define a label and its value. the value is result from the previous script and was saved in a variable and used here
set_ot_object last label chassis_2 ${chassis_2}

### mandatory to quit the script
script_exit "Finished running" 0


____________________________________________________________________________________________________________________________________________________

### create the json file which we need for this script
/config/skyline_custom_metrics/chassis_state.json


### documentation is missing the "secured" parameter but sklnctl complains if this is missing
### what the meaning of "secured" is not documented but sklnctl command complains if not added

{
  "state" : "enabled",
  "command" : "/config/skyline_custom_metrics/chassis_state.sh",
  "desc" : "Chassis status in Maestro",
  "name" : "chassis.state",
  "type" : "Gauge",
  "unit" : "{bool}",
  "interval" : 15,
  "secured" : "false"
}


____________________________________________________________________________________________________________________________________________________


### as a test you may run the script like this and you get a JSON output back which tells you if your script worked:
chmod 775 /config/skyline_custom_metrics/*
/config/skyline_custom_metrics/chassis_state.sh


### copy script and json to all members of the Maestro Cluster
asg_cp2blades /config/skyline_custom_metrics/ -r


## --name is the script name not the filename and is wrong in the documentation
## --path is the path to the json not the shell script, the json has the path to the shell script. the documentation is wrong here
## "secured" needs to be added to the json and is missing in documentation
## yes confirms the confirmation request
gexec -b all -c 'yes | sklnctl otlp add --name /config/skyline_custom_metrics/chassis_state --path /config/skyline_custom_metrics/chassis_state.json'

## you need to enable the script first which is missing in documentation
## --name you defined earlier
## "script" means it is of type script. you can enable and disable "collectors" with this command, too.
gexec -b all -c 'sklnctl otlp enable --name chassis_state script'


### restart the otlp and otelcol processes and wait for the metrics.
g_all /opt/CPotlpAgent/CPotlpagentCli.sh stop; sleep 2; g_all /opt/CPotlpAgent/CPotlpagentCli.sh start
g_all /opt/CPotelcol/CPotelcolCli.sh stop; sleep 2; g_all /opt/CPotelcol/CPotelcolCli.sh start



### To add additional processes to the monitoring you may add these by the following commands

## add additional system processes to the monitoring
## To check if the process is monitored check this metric: "process_cpu_usage"
## I added pepd, pdpd and rsyslogd - I used "ps -ef | sort" to get a list on the system
gexec -b all -c 'sklnctl otlp process --add pepd,pdpd,rsyslogd'

### shows the list of all monitored processes
sklnctl otlp process --show

### restart the otlp and otelcol processes
g_all /opt/CPotlpAgent/CPotlpagentCli.sh stop; sleep 2; g_all /opt/CPotlpAgent/CPotlpagentCli.sh start
g_all /opt/CPotelcol/CPotelcolCli.sh stop; sleep 2; g_all /opt/CPotelcol/CPotelcolCli.sh start
7 Replies
David_Evans
Collaborator

Do we want to start a maestro only thread?

I have a maestro view that lets me select in skyline the firewall name that shows up in smartconsole.   

I will show first some averages / summaries of all the blades in that SG.

Then I have it walk through all the blades / members in that SG and show the CPU Memory network.... etc for each blade using the <smartconsolename01-01  01-02  01-03...>  names to get the individual blade specs.

I think some of your above script will help with that as I have issues figuring out when to "stop" and how to tell if a specific blade number is down / missing because its been pulled out at the SMO level or because it has an issue...


0 Kudos
Alexander_Wilke
Advisor

@David_Evans I think I still have this. I created a dashboard variable for "host_name" and this allows me to view all my firewalls and SGMs in a separate "row". It's based on @Kaspars_Zibarts Dashboard which I found somewhere here in the forum.

 

I want to add missing metrics like the 64k/maestro specific metrics and commands.

asg stat -v
fwaccel stats -s
orch_stat -p
and others.

Building these scripts is more or less easy stepp. I use the AI and it build me the shell scripts.
My problem ist that these metrics are not collected or not send correctly or otlpagent stops executing these scripts because of high CPU usage.

My "asg stat -v" panel looks like this. To use the panel json you need Grafana 12.0.0 or higher and the new layouts feature toggle enabled.

asg_stat_v.png

Grafana Panel json as attachment.

My complete CheckPoint Dahboard added as attachment.


I addition I added the 
fwaccel stats -s
asg stat -v
orch_stat -p (tx and rx)

scripts. if you run them individually they work. if you run them from "sklnctl otlp add --name" the do not work or only on some SGMs not all. Have several tickets open at my diamond team.


 

 

0 Kudos
_Val_
Admin
Admin

Great job, @Alexander_Wilke 

Please look at my email, as we might have some plans for your script 🙂

0 Kudos
Alexander_Wilke
Advisor

Hello,

I have a bunch of new and updated scripts. I think they work - at least all of them produce JSON output.
Unfortunately the OTLP Agent including version 103 has bugs with cs_data_handle file locks, preventing multiple scripts to run. This issue should be solved in take 114 but this is not released.

Not sure if this CPotlpAgent Version 114 will fix issue with Scripts running long /20s) and generating (high) CPU load e.g. on an MHO with only 2 CPU Cores.

PS:
I am only allowed to add 20 files so I added the shell scripts. you may need to add the json files on your own.

 

 

 

0 Kudos
Sven_Glock
Advisor

Nice work thank you! 👍
I already stopped my work with custom  scripts due to the mentioned stability issues. Hope those are fixed soon.
One more for your list could be "asg stat vs all" which is showing HA state of VSX virtual systems in a maestro cluster.
A few weeks ago I already opened an RFE for this, but RFEs take more time that creating custom scripts 🙂

0 Kudos
Alexander_Wilke
Advisor

Hi Sven,

I do not use any virtual systems or security groups so I (a) don't know how exactly the output looks like with multiple VS and (b) I do not know which label:value pairs would be relevant/interesting in a metric.

 

To generate these scripts I put the checkpoint custom metrics documentation and the exact output of the command into the ChatGPT or other chat box and then iterate as long as I have a script which gives me json output.

I noticed if I use some simple tasks first and get a working script then AI can use these to better create the followup scripts.

However - as long as they do not run reliably it is useless. Hopefully take 114 will solve these issues

Sven_Glock
Advisor

If a kind fairy conjures up some free time for me, I would gladly try to delve into the topic of VSX/Security Group based on your scripts and would post it here.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events