Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
WiliRGasparetto
MVP Diamond
MVP Diamond

Maestro Troubleshooting in Practice

Operational runbook (MHO + SGMs + traffic + VPN) with field commands

If you run Quantum Maestro in production, you’ve probably seen the pattern: issues that “look like VPN” or “look like policy” often turn out to be Security Group health, a single divergent SGM, a physical/link problem (cable/port/optics), or an unstable uplink. The key to reducing MTTR is discipline: evidence + commands, without skipping layers.

Below is a practical “copy-and-run” runbook, with good vs bad interpretation.

 

1) Mental model 

  • MHO (Orchestrator): controls the Security Group (inventory, health, ports, fabric).

  • SGMs: run the dataplane (sessions, inspection, VPN, state).

  • Typical symptom patterns:

    • Unhealthy SG → everything becomes a symptom (policy/VPN/traffic).

    • Unhealthy single SGM → intermittent behavior (“sometimes it works”).

2) clish vs gclish (why this becomes a real incident)

clish

  • Local node context.

  • Useful for point inspection, but risky for configuration in Maestro because it can introduce drift (one member behaving differently).

gclish

  • Global Security Group context.

  • Operational rule:

    • use gclish when the intent is global consistency (uniform validation/collection/adjustment);

    • use clish only when you need to inspect/act on a specific member in a controlled way.

A recurring field root cause: a change made with clish on a single member → the SGM starts handling traffic differently → intermittent symptoms that are hard to reproduce.

 

3) Fast triage start with the Security Group

3.1 Global Maestro / Security Group health

On the MHO:

orch_stat -all

What this proves:

  • whether all SGMs are present/operational

  • whether any member is degraded/missing

  • signals of port/fabric issues

Good: all members OK, stable links, no critical port down.
Bad: missing/degraded member, unstable links → fix the foundation before analyzing VPN/policy.


3.2 Security Group sanity check

asg diag verify

What this proves: high-level SG consistency and quick integrity checks.
Bad: critical alerts → return to orch_stat -all and isolate the failing member/port.


3.3 Capacity before taking member-level actions

asg perf -v

What this proves: whether the SG has enough headroom (CPU/memory) to absorb load during isolation/actions.
Bad: SG near its limits → avoid disruptive actions.


3.4 Reconcile state (use with care)

hcp -r all

Note: commonly used in playbooks to recover internal state/handshakes, but it should not be the first “blind” step.

 

4) Physical and link health (where most “bugs” actually start)

When you see intermittency, “traffic disappears,” or only some users/flows fail, first prove whether there is physical/L1–L2 instability.

4.1 Inventory/port-map quick reference

 

orch_stat -p

or

cat /etc/maestro.json

Use this to confirm interface/port mapping in the Maestro context.

4.2 Counters and drops (all members)

g_all netstat -ni

What to look for: increasing RX-ERR/TX-ERR/drops.
If these counters climb, they often explain VPN flapping, broken sessions, and “policy is OK but traffic fails.”

4.3 Per-interface physical errors (CRC/symbol errors)

ethtool -S <interfacename>

Good: no CRC/errors increasing.
Bad: CRC/symbol errors → treat as L1/L2 (cable/optics/port/switch) before focusing on VPN.

4.4 Real link flap (carrier)

asg_ifconfig | grep carrier | grep -v "carrier: 0"

Bad: carrier oscillation → intermittent behavior is highly likely.

4.5 Hardware health (sensors)

g_all cpstat -f sensors os

What this proves: thermal/power/fan conditions can lead to instability and erratic behavior.

4.6 Maestro port state

show maestro port <port>

Confirms the port’s state/configuration in the Maestro domain.

 

5) The turning point: “no log” — does the traffic exist in the SG dataplane?

This step quickly separates “problem before the gateway” from “problem inside the gateway.”

5.1 Prove the session/connection on the SG

Example (intentionally generic IPs):

asg search -v 10.10.40.25 \* 203.0.113.50 443 tcp

Interpretation:

  • No output: traffic likely is not reaching the SG (or it’s taking a different path). Return to L1/L2/L3 and capture at the correct point.

  • Output present: traffic exists in the dataplane; you now have a basis to correlate with NAT, routing, policy, and VPN.

If the connection does not “exist” for the SG, changing policy/VPN is usually wasted effort.

 

6) Single-SGM failure: how to investigate and restore consistency

Typical symptom: intermittent failures, “some flows drop,” “works after some time.”

6.1 Controlled action to reintegrate a suspected member (when needed)

On the suspected SGM:

clusterXL_admin down
clusterXL_admin up

Risk: medium (sessions anchored to that member can be impacted).
Pre-condition: confirm headroom with asg perf -v.

6.2 Check state and drift indicators

cphaprob list
tail $FWDIR/log/blade_config

What to look for:

  • cphaprob list: HA/cluster participation/state signals and inconsistencies

  • blade_config: alerts and errors that indicate configuration drift

Closing

Maestro troubleshooting requires discipline: start with SG health, then prove traffic exists, then validate physical stability, and only then go deeper. If you follow this sequence with objective commands, “phantom incidents” drop sharply—and troubleshooting becomes engineering, not guesswork.

(3)
15 Replies
israelfds95
MVP Gold
MVP Gold

very good, very useful 

WiliRGasparetto
MVP Diamond
MVP Diamond

thk's Israel

(1)
WiliRGasparetto
MVP Diamond
MVP Diamond

Throughout my career I’ve learned to start with the fundamentals, because 90% of problems are solved there. In Maestro’s case, it’s no different — most of the issues I’ve resolved happened because the analyst didn’t know how to differentiate between clish and gclish, which ended up causing misconfigurations.

the_rock
MVP Diamond
MVP Diamond

Awesome work. Btw, I could not agree more with what you said. I cant even count how many times I been on calls with people and it usually turned out to be something so simple at the end that solved the issue.

Best,
Andy
"Have a great day and if its not, change it"
WiliRGasparetto
MVP Diamond
MVP Diamond

I’ve already seen troubleshooting cases that lasted for days turn out to be just a simple VLAN issue. Usually, people miss the basics, focus on the more complex aspects, and forget to check the fundamentals.

Serge_Wuethrich
Explorer
Explorer

Good and useful guideline in general. I just would like to point out, that few of your commands (asg diag, asg perf, asg search) do not exist anymore in R82 and have been moved to insights or cluster-cli.

Check the release notes for more changes regarding Maestro:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_RN/Content/Topics-RN/Software-Chan...

 

WiliRGasparetto
MVP Diamond
MVP Diamond

Thank you very much for the tip. I still haven’t had the opportunity to work with R82 on Maestro, so I’ll take a look. It will be good for me to understand the differences between them and, perhaps, even update the title of this topic to R81.20, since it may indeed be obsolete in R82.

(1)
Dom_Galvao
Explorer

very good content with practical examples.

emmap
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

This guide seems to conflate the MHOs and the SMO. The SMO is not an orchestrator, it's an SGM. 

I don't think there's a file called /etc/maestro.json. For a port inventory at the MHO you would use orch_stat -p, or the MHO WebUI in R82+.

That 'last resort' of just deleting the security group with no follow up is terrible advice. What's going on there, you're just going to remove the group entirely and give up? Please review this and make sure you're not suggesting steps that will cause massive problems. There are many other things that can be attempted in a troubleshooting context before going nuclear here. 

(1)
_Val_
Admin
Admin

@emmap /etc/maestro.json file is mentioned in sk164712. 

About removing a security group, I agree that would be a very bad move in a production environment. 

My understanding is, @WiliRGasparetto is writing this based on his lab trials.

emmap
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

Yep, you're right that is a file, my mistake. 

0 Kudos
WiliRGasparetto
MVP Diamond
MVP Diamond

I’m going to remove that step, and I’ll look for better approaches. I included it only as a last-resort option when there was truly no solution and in coordination with Check Point TAC, but presenting it as a standard solution was a bad idea. Thank you very much for the feedback.

WiliRGasparetto
MVP Diamond
MVP Diamond

I also added the command `orch_stat -p` as the first option and then the verification with `cat /etc/maestro.json`. I found your point very helpful,  @emmap .

batata
Explorer

Nice

WiliRGasparetto
MVP Diamond
MVP Diamond

thk's